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Preface 



Welcome to the proceedings of the 8th European Conference on Computer Vi- 
sion! 

Following a very successful ECCV 2002, the response to our call for papers 
was almost equally strong - 555 papers were submitted. We accepted 41 papers 
for oral and 149 papers for poster presentation. 

Several innovations were introduced into the review process. First, the num- 
ber of program committee members was increased to reduce their review load. 
We managed to assign to program committee members no more than 12 papers. 
Second, we adopted a paper ranking system. Program committee members were 
asked to rank all the papers assigned to them, even those that were reviewed 
by additional reviewers. Third, we allowed authors to respond to the reviews 
consolidated in a discussion involving the area chair and the reviewers. Fourth, 
the reports, the reviews, and the responses were made available to the authors as 
well as to the program committee members. Our aim was to provide the authors 
with maximal feedback and to let the program committee members know how 
authors reacted to their reviews and how their reviews were or were not reflected 
in the final decision. Finally, we reduced the length of reviewed papers from 15 
to 12 pages. 

The preparation of ECCV 2004 went smoothly thanks to the efforts of the or- 
ganizing committee, the area chairs, the program committee, and the reviewers. 
We are indebted to Anders Heyden, Mads Nielsen, and Henrik J. Nielsen for 
passing on ECCV traditions and to Dominique Asselineau from ENST/TSI who 
kindly provided his GestRFIA conference software. We thank Jan-Olof Eklundh 
and Andrew Zisserman for encouraging us to organize ECCV 2004 in Prague. 
Andrew Zisserman also contributed many useful ideas concerning the organiza- 
tion of the review process. Olivier Faugeras represented the ECCV Board and 
helped us with the selection of conference topics. Kyros Kutulakos provided hel- 
pful information about the CVPR 2003 organization. David Vernon helped to 
secure EC Vision support. 

This conference would never have happened without the support of the 
Centre for Machine Perception of the Czech Technical University in Prague. 
We would like to thank Radim Sara for his help with the review process and 
the proceedings organization. We thank Daniel Vecerka and Martin Matousek 
who made numerous improvements to the conference software. Petr Pohl helped 
to put the proceedings together. Martina Budosova helped with administrative 
tasks. Hynek Bakstein, Ondrej Chum, Jana Kostkova, Branislav Micusfk, Stepan 
Obdrzalek, Jan Sochman, and Vft Zyka helped with the organization. 
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Abstract. Data association (obtaining correspondences) is a ubiquitous 
problem in computer vision. It appears when matching image features 
across multiple images, matching image features to object recognition 
models and matching image features to semantic concepts. In this pa- 
per, we show how a wide class of data association tasks arising in com- 
puter vision can be interpreted as a constrained semi-supervised learn- 
ing problem. This interpretation opens up room for the development of 
new, more efficient data association methods. In particular, it leads to 
the formulation of a new principled probabilistic model for constrained 
semi-supervised learning that accounts for uncertainty in the parameters 
and missing data. By adopting an ingenious data augmentation strategy, 
it becomes possible to develop an efficient MCMC algorithm where the 
high-dimensional variables in the model can be sampled efficiently and di- 
rectly from their posterior distributions. We demonstrate the new model 
and algorithm on synthetic data and the complex problem of matching 
image features to words in the image captions. 



1 Introduction 

Data association is an ubiquitous problem in computer vision. It manifests itself 
when matching images (eg stereo and motion data [1]), matching image features 
to object recognition models [2] and matching image features to language de- 
scriptions [3] . The data association task is commonly mapped to an unsupervised 
probabilistic mixture model [4,1,5]. The parameters of this model are typically 
learned with the EM algorithm or approximate variants. This approach is fraught 
with difficulties. EM often gets stuck in local minima and is highly dependent on 
the initial values of the parameters. Markov chain Monte Carlo (MCMC) meth- 
ods also perform poorly in this mixture model scenario [6]. The reason for this 
failure is that the number of modes in the posterior distribution of the parame- 
ters is factorial in the number of mixture components [7] . Maximisation in such 
a highly peaked space is a formidable task and likely to fail in high dimensions. 
This is unfortunate as it is becoming clear that effective learning techniques for 
computer vision have to manage many mixture components and high dimensions. 
Here, we take a new route to solve this vision problem. We cast the data asso- 
ciation problem as one of constrained semi-supervised learning. We argue that 
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it is possible to construct efficient MCMC algorithms in this new setting. Effi- 
ciency here is a result of using a data augmentation method, first introduced in 
econometrics by economics Nobel laureate Daniel McFadden [8], which enables 
us to compute the distribution of the high-dimensional variables analytically. 
That is, instead of sampling in high-dimensions with a Markov chain, we sample 
directly from the posterior distribution of the high-dimensional variables. This, 
so called Rao-Blackwellised , sampler achieves an important decrease in variance 
as predicted by well known theorems from Markov chain theory [9]. 

Our approach is similar in spirit to the multiple instance learning paradigm 
of Dietterich et al [10]. This approach is expanded in [11] where the authors 
adopt support vector machines to deal with the supervised part of the model 
and integer programming constraints to handle the missing labels. This optimi- 
sation approach suffers from two problems. First, it is NP-hard so one has to 
introduce heuristics. Second, it is an optimisation technique and as such it only 
gives us a point estimate of the decision boundary. That is, it lacks a proba- 
bilistic interpretation. The approach we propose here allows us to compute all 
probabilities of interest and consequently we are able to obtain not only point 
estimates, but also confidence measures. These measures are essential when the 
data association mechanism is embedded in a meta decision problem, as is often 
the case. 

The problem of semi-supervised learning has received great attention in the 
recent machine learning literature. In particular, very efficient kernel methods 
have been proposed to attack this problem [12,13]. Our approach, still based on 
kernel expansions, favours sparse solutions. Moreover, it does not require super- 
vised samples from each category and, in addition, it is probabilistic. The most 
important point is that our approach allows for the introduction of constraints. 
Adding constraints to existing algorithms for semi-supervised learning leads to 
NP-hard problems, typically of the integer programming type as in [11]. 

We introduce a coherent, fully probabilistic Bayesian model for constrained 
semi-supervised learning. This enables us to account for uncertainty in both 
the parameters and unknown labels in a principled manner. The model applies 
to both regression and classification, but we focus on the problem of binary 
classification so as to demonstrate the method in the difficult task of matching 
image regions to words in the image caption [3]. 

Our contribution is therefore threefold: a new approach to a known com- 
plex data association (correspondence) problem, a general principled probabilis- 
tic model for constrained semi-supervised learning and a sophisticated blocked 
MCMC algorithm to carry out the necessary computations. 

2 Data Association as Constrained Semi-supervised 
Learning 

There are many large collections of annotated images on the web, galleries and 
news agencies. Figure 1 shows a few annotated images from the Corel image 
database. By, for example, segmenting the images, we can view object recogni- 
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Fig. 1 . Annotated images from the Corel database. We would like to automatically 
match image regions to words in the caption. That is we don’t know the right associ- 
ations (correspondences) between image features and text features. 

tion as the process of finding the correct associations between the labels in the 
caption and the image segments. Knowing the associations allows us to build 
a translation model that takes as input image features and outputs the appro- 
priate words; see [3] for a detailed description. A properly trained translation 
model takes images (without any captions) as input and outputs images with 
labelled regions. 

What makes this approach feasible is that the training set of images like 
the leftmost three images in Figure 1 is vast and ever increasing. On the other 
hand, a supervised approach using training data like the right-most image, where 
segments have been annotated, is very problematic in practice, as labelling in- 
dividual segments (or other local image features) is hard and time-consuming. 

This data association problem can be formulated as a mixture model similar 
to the ones used in statistical machine translation. This is the approach originally 
proposed in [3] and extended in [14] to handle continuous image features. The 
parameters in both cases were learned with EM. The problem with this approach 
is that the posterior over parameters of the mixture model has a factorial number 
of modes and so EM tends to get stuck in local minima. The situation is no better 
for MCMC algorithms for mixture models because of this factorial explosion of 
modes [6]. This calls for a new approach. 

We can convert the data association problem to a constrained semi-supervised 
learning problem. We demonstrate this with the toy example of Figure 1. Suppose 
we are interested in being able to detect boats in images. We could assume that 
if the word boat does not appear in the caption, then there are no boats in 
the image 1 . In this case, we assign the label 0 to each segment in the image. If 
however the word boat appears in the caption, then we know that at least one of 
the segments corresponds to a boat. The problem is that we do not know which. 
So we assign question marks to the labels in this image. Sometimes, we might be 
fortunate and have a few segment labels as in the rightmost image of Figure 1. 

By letting X{ denote the feature vector corresponding to the i-th segment 
and Hi denote the existence of a boat, our data association problem is mapped 
to the following semi-supervised binary classification task 

1 Of course, this depends on how good the labels are, but as mentioned earlier, there 
are many databases with very good captions; see for example www.corbis.com. So 
for now we work under this assumption. 
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image 1 


image 2 


image 3 


image 4 


Input x 
Labels y 


Xi x 2 x 3 
? ? ? 


X ^ X Ej Xq 

0 0 0 


x 7 x 8 x 9 x 10 

0 0 0 0 


x n x 12 

0 1 



Note that for the question marks, we have the constraint that at least one of 
them has to be a 1 (this is what leads to the integer programming problem in 
optimisation approaches). To be able to annotate all the image segments, we 
need to build one classifier for each word of interest. This is sound from an 
information retrieval point of view [11]. From an object recognition perspective, 
we would like to adopt multicategorical classifiers. Here, we opt for a simple 
solution by combining the responses of the various binary classifiers [15]. 

In more precise terms, given the training data V (a collection of images with 
captions) the goal is then to learn the predictive distribution p(y = 1 1 x) , where 
y is a binary indicator variable that is 1 iff the new test-set image segment 
represented by x is part of the concept. If we use a model with parameters 0, 
the Bayesian solution is given by 

p(y = l\x) = J p(y = l\x,0)p(6\T>) dO. 

That is, we integrate out the uncertainty of the parameters. The problem with 
this theoretical solution is that the integral is intractable. To overcome this 
problem, we sample 0 according to p ( 6\ V) to obtain the following approximation 

p(y = l|a;) « F = 1| x,0i) 

i 

where Oi is one of the samples. This approximation converges to the true solution 
by the Strong Law of Large Numbers. This approach not only allows us to 
compute point estimates, but also confidence intervals. In the next section, we 
outline the probabilistic model. 

3 Parametrization and Probabilistic Model 

Our training data V consists of two parts, the set of blob description vectors 
{xi:Ar} with Xi G for i = 1, . . . , TV and a set of binary labels y k . The full set of 
labels includes the known and unknown labels, y = {y k ,y u }- Our classification 
model is as follows 



Pr (yi = l\ xi,/3,^) = &(f (xi,(3,i )) , (1) 

where i> (u) = exp (—a 2 / 2) da is the cumulative function of the stan- 

dard Normal distribution. This is the so-called probit link. By convention, re- 
searchers tend to adopt the logistic link function cp (u) = (1 + exp (— 'u)) -1 . How- 
ever, from a Bayesian computational point of view, the probit link has many 
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advantages and is equally valid. Following Tam, Doucet and Kotagiri[16], the 
unknown function is represented with a sparse kernel machine with kernels cen- 
tered at the data points x\ : n'- 



N 

f (%, P, j) = Po + ^2 lifcK (x, Xi) . (2) 

i= 1 

Here f3 is a N-dimensional parameter vector and K is a kernel function. Typical 
choices for the kernel function K are: 

— Linear: K(xi,x) = || Xi — x\\ 

— Cubic: K(xi , x) = || Xi — x \\ 3 

— Gaussian: K (pci , x) = exp (— X\\xi — x|| 2 ) 

— Sigmoidal: K(xi,x) = tanh (A||a^ — x|| 2 ) 

The last two kernels require a scale parameter A to be chosen. The vector of 
unknown binary indicator variables 7 G{ 0 ,l} Ar is used to control the complexity 
of the model. It leads to sparser solutions and updates, where the subset of active 
kernels adapts to the data. This is a well studied statistical model [16]. 

When all the kernels are active, we can express equation (2) in matrix nota- 
tion 



f(x i ,0)=#Tp, 



where ^ denotes the i-th row of the kernel matrix 

~lK{x\,x\) K(x!,x 2 ) ••• K(x 1 ,x n ) 
1 K(x 2 ,x 1 ) K(x 2 ,x 2 ) ■■■ K(x 2 ,x N ) 

& = . 

_1 K(x 2 ,xi) K(x n ,x 2 ) ■ ■ ■ K(x n ,x n ) 



(3) 



When only a subset of kernels is active, we obtain a sparse model: 

/ ( X i 5 Pj ) = { P'l 5 

where lF 7 is the matrix consisting of the columns j of IF where 7 ^ = 1 . then 
is the i-th row of this matrix. p 1 is the reduced version of /?, only containing 
the coefficients for the activated kernels. In [16], this model is applied to super- 
vised learning and shown to produce more accurate results than support vector 
machines and other kernel machines. Here, we need to extend the model to the 
more general scenario of semi-supervised learning with constraints in the labels. 

We adopt a hierarchical Bayesian model [17]. We assume that each kernel 
is active with probability r, i.e. p{^\r) is a Bernoulli distribution. Instead of 
having the user choose a fixed r a priori, we deal with this parameter in the 
Bayesian way and assign a prior p (r) to it. This way, the value of r is allowed 
to adapt to the data. At the same time we can bias it by specifying the prior 
p (r) according to our prior belief as to what the value of r should be. While 
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Tam, Doucet and Kotagiri [16] use the completely uninformative uniform prior, 
we instead choose to put a conjugate Beta-prior on r which allows the user to 
exert as much control as desired over the percentage of active kernels 



p(r) 



na + b ) , 

r(a)r(b) 



(1 -r) 6 - 1 . 



( 4 ) 



For the choice a = b = 1.0, we get the uninformative uniform distribution. We 
obtain the prior on the binary vector 7 by integrating over r 



V 




p{i\t)p{t) dr 



r{E 7 + a) r(N - + 6 ) 

r(N + a + b) 



( 5 ) 



where A 7 is the number of active kernels, i.e. the number of non zero elements 
in 7. 

A (maximum entropy) g-prior is placed on the coefficients /?: 

P(f3)=m (6) 



where the regularisation parameter is assigned an inverse gamma prior: 

P (6 2 )=ig (|,|). ( 7 ) 

This prior has two parameters y and v that have to be specified by the user. 
One could argue that this is worse than the single parameter S 2 . However, the 
parameters of this hyper-prior have a much less direct influence than S 2 itself and 
are therefore less critical for the performance of the algorithms [17]. Assigning 
small values to these parameters results in an uninformative prior and allows S 2 
to adapt to the data. 



3.1 Augmented Model 

We augment the probabilistic model artificially in order to obtain an analytical 
expression for the posterior of the high-dimensional variables /3. In particular, 
we introduce the set of independent variables zi G M, such that 

Zi = f 7 ) + n t , (8) 

where n t M (0, 1). The set of augmentation variables consists of two subsets 
z = {z k ,z u }, one corresponding to the known labels y k and the other to the 
unknown labels y u . For the labelled data, we have 

P (4 \/ 3 , J,Xi) 7 ), 1 ) = V(?ZV/3 7 ,l) • (9) 



We furthermore define 






1 if z\ > 0, 
0 otherwise. 
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It is then easy to check that one has the required result: 

Pr (j/f = l| Xi,f3 y , 7 ) = Pr (zj > 0| £j,/3 7 ) = Pr (n* > -^/? 7 ) = (^ 7 A) • 

Now, let y k . k +i denote the set of missing labels for a particular image (a 
set of question marks as described in Section 2 ). The prior distribution for the 
corresponding augmentation variables z k:k+l is then: 






J J j ^ 

n 



j—k 



^c( Z k:k+l) 



( 10 ) 



where Iq(uj) is the set indicator function: 1 if a; G 12 and 0 otherwise. Our 
particular set of constraints is C = {one or more z™ > 0}. That is, one or more 
of the z'j must be positive so that at least one of the y u are positive. This 
prior is a truncated Normal distribution with the negative octant missing. The 
hierarchical Bayesian model is summarised in Figure 2 . 




Fig. 2. Our directed acyclic graphical model. Note that by conditioning on z, y is 
independent of the model parameters. 



3.2 Posterior Distribution 

The posterior distribution follows from Bayes rule 

p(P, 7, z\y k , Xi : n) oc p(y k \z k )p(i)p(/3\5 2 )p(6 2 )p(z u \f3, 7 , x)p(z k \/3, 7 , x) 

The key thing to note, by looking at our graphical model, is that by condition- 
ing on the 1 -dimensional variables z, the model reduces to a standard linear- 
Gaussian model [17]. We can as a result obtain analytical expressions for the 
conditional posteriors of the high-dimensional variables (3 and the regularisation 
parameter 5 

p(p\z,x,^,s 2 ) = (yT^y'yT z _A_ (H) 

li + U-y + 1 u + 



p($ 2 P,/?,7) =ig 



2 



2 



( 12 ) 
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where 2 is the vector (zi,Z 2 ,... , £/v) T - The posterior distribution of the aug- 
mentation variables z k is given by the following truncated Normal distributions: 



P(Zi\Pn,Xi,Vi) ocp(y^\z^)p(z^\xi,P,j) = 



v (^ 7 »/?, l) I( 0 ,+oo) ( 4 ) if Vi = 1 

Af(^A l)l(_oo,o] ( 4 ) if Vi = 0 

(13) 



4 MCMC Computation 

We need to sample from the posterior distribution p(6 | X>), where 0 represents 
the full set of parameters. To accomplish this, we introduce a Metropolised 
blocked Gibbs sampler. In short, we sample the high-dimensional parameters 
/3 and the regularisation parameters directly from their posterior distributions 
(equations (11) and ( 12 )). It is important to note that only the components of 
f3 associated with the active kernels need to be updated. This computation is 
therefore very efficient. The 7 are sampled with the efficient MCMC algorithm 
described in detail in [16]. The z u are sampled from the truncated multivari- 
ate Gaussian in equation (10), while the z k are sampled from the truncated 
distributions given by equation (13). 

To sample from the truncated Gaussian distributions, we use the specialised 
routines described in [18]. These routines based on results from large deviation 
theory are essential in order to achieve good acceptance rates. We found in our 
experiments that the acceptance rate was satisfactory (70% to 80%). 

5 Experiments 

5.1 Synthetic Data 

In this first experiment we tested the performance of our algorithm on synthetic 
data. We sampled 300 data points from a mixture model consisting of a Gaus- 
sian and a surrounding ring with Gaussian cross section (see Figure 3(a)). Data 
points generated by the inner Gaussian were taken to be the positive instances, 
while those on the ring were assumed to be negative. The data points were then 
randomly grouped into groups (representing documents) of 6 data points each. 
In the given example, this resulted in 12 groups with exclusively negative data 
points, and 38 groups with both positives and negative instances. This corre- 
sponds to 72 data points with known negative labels and 228 data points with 
unknown but constrained labels. 

We ran our algorithm on this data for 2000 samples (after a burn-in period 
of 1000 samples) using uninformative priors and a sigmoidal kernel with kernel 
parameter A = 1.0. Although no data points were explicitly known to be positive 
in this case, the information of the constraints was sufficient to learn a nice 
distribution p (y = 1 | x) as shown in Figure 3(b). Using an appropriate threshold 
produces a perfect classification in this example. 
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(a) The given data points 



(b) Surface plot of p ( y — l\x) 



Fig. 3. Experiment with synthetic data, (a) shows the generated data points. Instances 
with known negative labels are shown as filled circles whereas data points with unknown 
label are represented by the + symbol. The plot in (b) visualizes the probability distri- 
bution computed by our approach. The distribution obviously nicely separates positive 
and negative examples and thus provides an excellent classifier. 



sky 




(a) performance on training 
data 



sky 




(b) performance on test data 



Fig. 4. ROC plots (as in Figure 5) for the annotation ’sky’ . The average performance of 
our proposed approach is visualized by the solid line, that of the EM mixture algorithm 
by the dashed line. The error bars represent the standard deviation across 20 runs. It 
is clear from the plots that our proposed algorithm is more reliable and stable. 
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5.2 Object Recognition Data Set 

For this experiment, we used a set of 300 annotated images from the Corel 
database. The images in this set were annotated with in total 38 different words 
and each image was segmented into regions using normalised cuts [19]. Each of 
the regions is described by a 6-dimensional feature vector ( CIE-Lab colour, y 
position in the image, boundary to area ratio and standard deviation of bright- 
ness ). The data set was split into one training set containing 200 images with 
2070 image regions and a test set of 100 images with 998 regions. 

We compared two learning methods in this experiment. The first consisted 
of a mixture of Gaussians translation model trained with EM [3,14]. The second 
is the method proposed in this paper. We adopted a vague hyper-prior for S 2 
(/i = v = 0.01). Experiments with different types of kernels showed the sigmoidal 
kernel to work best for this data set. Not only did it produce better classifiers 
than linear, multi-quadratic, cubic and Gaussian kernels, it also led to numeri- 
cally more stable and sparser samplers. The average number of activated kernels 
per sample was between 5 and 20, depending on the learned concept. 

We used both EM with the mixture model and our new constrained semi- 
supervised approach to learn binary classifiers for several of the words in this 
dataset. The Markov chains were run for 10,000 samples after a burn-in phase 
of 10,000 samples. On a 2.6 Ghz Pentium 4, run times for this were in the range 
of 5 to 10 minutes, which is perfectly acceptable in our setting. 

The performance of the learned classifiers was then evaluated by comparing 
their classification results for varying thresholds against a manual annotation of 
the individual image regions. The ROC plots in Figure 5 show the results aver- 
aged over 20 runs, plotting true positives against false positives. The plots show 
that the approach proposed in this paper yields significantly better classification 
performance than the EM mixture method. Given the relative simple features 
used and the small size of the data set, the performance is remarkably good. 
Figure 5.2 shows that the classifiers learned using the proposed approach gen- 
eralize fairly well even where the EM mixture approach fails due to overfitting 
(look at the results for the concept ’space’ for an example). 

Figure 4 illustrates the dramatically higher consistency across runs of the 
algorithm proposed in this paper as compared to the EM algorithm for the 
mixture model. The error bars indicate the standard deviation of the ROC plots 
across the 20 runs. The large amount of variation indicates that the EM got stuck 
in local minima on several runs. While with the Corel data set this problem arose 
only for some of the categories, in larger and higher dimensional data sets, local 
minima are known to become a huge problem for EM. 
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Fig. 5. ROC plots measuring the classification performance on image regions from 
the Corel image dataset of both the proposed algorithm (solid line) and the EM 
mixture algorithm (dashed line), averaged over 20 runs. The x axis measures 

negatives positives while ^ y ^ corresponds to . 

The plots are generated by using the learned probabilistic classifiers with varying 
thresholds and allow to compare the classifiers independent of a chosen fixed threshold 
value. The performance on the test set is remarkable considering that the algorithm 
only has access to simple image features (and no text in any form). 
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Abstract. This paper focuses on how to perform the unsupervised clus- 
tering of tree structures in an information theoretic setting. We pose the 
problem of clustering as that of locating a series of archetypes that can 
be used to represent the variations in tree structure present in the train- 
ing sample. The archetypes are tree-unions that are formed by merging 
sets of sample trees, and are attributed with probabilities that measure 
the node frequency or weight in the training sample. The approach is de- 
signed to operate when the correspondences between nodes are unknown 
and must be inferred as part of the learning process. We show how the 
tree merging process can be posed as the minimisation of an information 
theoretic minimum descriptor length criterion. We illustrate the utility 
of the resulting algorithm on the problem of classifying 2D shapes using 
a shock graph representation. 



1 Introduction 

Graph-based representations have been used with considerable success in com- 
puter vision in the abstraction and recognition of object shape and scene struc- 
ture. Concrete examples include the use of shock graphs to represent shape- 
skeletons [10,15], the use of trees to represent articulated objects [8,19] and the 
use of aspect graphs for 3D object representation [2]. The attractive feature of 
structural representations is that they concisely capture the relational arrange- 
ment of object primitives, in a manner which can be invariant to changes in 
object viewpoint. However, despite the many advantages and attractive features 
of graph representations, the methodology available for learning structural rep- 
resentations from sets of training examples is relatively limited. As a result, the 
process of constructing shape-spaces which capture the modes of structural vari- 
ation for sets of graphs has proved to be elusive. Hence, geometric representations 
of shape such as point distribution models [6] , have proved to be more amenable 
when variable sets of shapes must be analyzed. There are two reasons why pat- 
tern spaces are more easily constructed for curves and surfaces than for graphs. 
First, there is no canonical ordering for the nodes or edges of a graph. Hence, 
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before a vector-space can be constructed, then correspondences between nodes 
must be established. Second, structural variations in graphs manifest themselves 
as differences in the numbers of nodes and edges. As a result, even if a vector 
mapping can be established then the vectors will be of variable length. 

One way of circumventing this problem is to embed the graphs in a low dimen- 
sional space using the distances between graphs or by using simple graph features 
that do not require correspondence analysis. For instance, Cyr and Kimia have 
used a geometric procedure to embed graphs on a view-sphere [1]. Demerici and 
Dickinson [9] have shown how the minimum distortion embedding procedure 
of Linial, London and Rabinovich [11] can be used for the purposes of corre- 
spondence matching. A recent review of methods that could be used to perform 
the embedding process is provided in the paper of Hjaltason and Samet [7]. 
However, although this work provides a means of capturing the distribution of 
graphs and can be used for clustering, it does not provide an embedding which 
allows a generative model of detailed graph structure to be learned. In other 
words, the distribution does not capture in an explicit manner the variations in 
the graphs in terms of changes in node and edge structure. Recently, though, 
there has been considerable interest in learning structural representations from 
samples of training data, in particular in the context of Bayesian networks [5,3], 
mixtures of tree-classifiers [12], or general relational models [4]. Unfortunately, 
these methods require the availability of node correspondences as a prerequisite. 

The aim in this paper is to develop an information theoretic framework for 
the unsupervised learning of generative models of tree-structures from sets of 
examples. We pose the problem as that of learning a mixture of union trees. 
Each tree union is an archetype that represents a class of trees. Those trees that 
belong to a particular class can be obtained from the relevant tree archetype by 
node removal operations. Hence, the union-tree can be formed using a sequence 
of tree merge operations. We work under conditions in which the node corre- 
spondences required to perform merges are unknown and must be located by 
minimising tree edit distance. Associated with each node of the union structure 
is a probability. This is a random variable which represents the frequency of 
the node in the training sample. Since every tree in the sample can be obtained 
from one of the union structures in the mixture, the tree archetypes are genera- 
tive models. There are three quantities that must be estimated to construct this 
generative model. The first of these are the correspondences between the nodes 
in the training examples and the estimated union structure. Secondly, there is 
the union structure itself. Finally, there are the node probabilities. We cast the 
estimation of these three quantities in an information theoretic setting using 
the description length for the union structure and its associated node proba- 
bilities given correspondences with the set of training examples [13]. With the 
tree-unions to hand, then we can apply use PCA to project the trees into a low 
dimensional vector space. 
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2 Generative Tree Model 

Consider the set or sample of trees V = {ti, £ 2 , • • • , t n }. Our aim in this paper is 
to cluster these trees, i.e. to perform unsupervised learning of the class structure 
of the sample. We pose this problem as that of learning a mixture of generative 
class archetypes. Each class archetype is constructed by merging sets of sample 
trees together to form a set of union structures. This merge process requires 
node correspondence information, and we work under conditions in which these 
are unknown and must be inferred as part of the learning process. Each tree 
in the sample can hence be obtained from one of the union-structures using a 
sequence of node removal operations. Thus the class archetypes are generative 
models since they capture in an explicit manner the structural variations for the 
sample trees belonging to a particular class in a probabilistic manner. 

Suppose that the set of class archetypes constituting the mixture model is 
denoted by T-L = {71, 72, • • • , 7*,}. For the class c, the tree model T c is a struc- 
tural archetype derived from the tree-union obtained by merging the set of trees 
V c C V constituting the class. Associated with the archetype is a probability dis- 
tribution which captures the variations in tree structure within the class. Hence, 
the learning process involves estimating the union structure and the parameters 
of the associated probability distribution for the class model T c - As a prerequi- 
site, we require the set of node correspondences C between sample trees and the 
union tree for each class. 

Our aim is to cast the learning process into an information theoretic setting. 
The estimation of the required class models is effected using a simple greedy 
optimization method. The quantity to be optimized is the descriptor length for 
the sample data-set V. The parameters to be optimized include the structural 
archetype of the model T as well as the node correspondences C between the 
samples in the set V and the archetype. Hence, the inter-sample node corre- 
spondences are not assumed to be known a priori. Since the correspondences 
are uncertain, we must solve two interdependent optimization problems. These 
are the optimization of the union structure given a set of correspondences, and 
the optimization of the correspondences given the tree structure. These dual 
optimization steps are approximated by greedily merging similar tree- models. 

We characterize uncertainties in the structure obtained by tree merge opera- 
tions by assigning probabilities to nodes. By adopting an information theoretic 
approach we demonstrate that the tree-edit distance, and hence the costs for the 
edit operations used to merge trees, are related to the entropies associated with 
the node probabilities. 

2.1 Probabilistic Framework 

More formally, the basis of the proposed structural learning approach is a gener- 
ative tree model which allows us to assign a probability distribution to a sample 
of hierarchical trees. Each hierarchical tree t is defined by a set of nodes A /* t , a 
tree-order relation O l C Af 1 xA between the nodes, and, in the case of weighted 
trees, a weight set W l = {w\\i G Af 1 } where w\ is the weight associated with 
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node i of tree t. A tree-order relation O l is an order relation with the added con- 
straint that if (x,y) G O 1 and (z,y) G O l , then either (x,z) G O l or (z,x) G O l . 
A node b is said to be a descendent of a, or a ^ 6, if (a, b) G O l . Furthermore, if 
b is a descendent of a then it is also a child of a if there is no node x such that 
a 'w x and x ^ 6, that is there is no node between a and b in the tree-order. 

Our aim is to construct a generative model for a class of trees V c C V. 
The structural component of this model T c consists of a set of nodes M c and 
an associated tree order relation O c C A f c x A f c . Additionally, there is a set 
e c = {Of y i G A f c } of sampling probabilities Of for each node i G A/* c . Hence the 
model is the triple % = (AT C , O C ,0 C ). A sample from this model is a hierarchical 
tree t = (A/’ t , O l ) with node set C M c and a node hierarchy O l that is the 
restriction to A/’ t of O c . In other words, the sample tree is just a subtree of the 
class archetype, which can be obtained using a simple set of edit operations that 
prune the archetype. 

The develop our generative model we make a number of simplifying assump- 
tions. First, we drop the class index c to simplify notation. Second, we assume 
that the set of nodes for the union structure T spans each of the encountered 
sample trees V y i.e. J\F = (J Third, we assume that the sampling error 
acts only on nodes, while the hierarchical relations are always sampled correctly. 
That is, if nodes i and j satisfy the relation iOj, then node i will be an ancestor 
of node j in each tree-sample that has both nodes. 

Our assumptions imply that two nodes will always satisfy the same hierar- 
chical relation whenever they are both present in a sample tree. A consequence 
of this assumption is that the structure of a sample tree is completely deter- 
mined by restricting the order relation of the model O to the nodes observed 
in the sample tree. Hence, the links in the sampled tree can be viewed as the 
minimal representation of the order relation between the nodes. The sampling 
process is equivalent to the application of a set of node removal operations to 
the archetypical structure T = (A/*, 0,0), which makes the archetype a union 
of the set of all possible tree samples. 

To define a probability distribution over the union structure T, we require 
the correspondences between the nodes in each sample tree t and the nodes 
in the class-model T. We hence define a map C : A/’ t -A A f from the set A/’ t 
of the nodes of £, to the nodes of the class model T. The mapping induces a 
sample-correspondence for each node i G A f . When the nodes of the sample 
trees have weights associated with them, then we would expect the sampling 
likelihood to reflect the distribution of weights. Hence, the simple probability 
distribution described above, which is based on uniform sample node probability, 
is not sufficient because it does not take into account the weight distribution. To 
overcome this shortcoming, in addition to the set of sampling probabilities 0, we 
associate with the union model a weight distribution function. Here we assume 
that the weight distribution is a rectified Gaussian. For the node i of the union 
tree the weight probability distribution is given by 



P(wj\C(j) =i) 
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where the weight distribution has mode fii and standard deviation The sam- 
pling probability is the integral of the distribution over positive weights, i.e. 



0,= 






hV 27t 



dw = 1 — erfc(ri), 



(i) 



where T{ = Pi/di and erfc is the complementary error function. Taking into 
account the correspondences, the probability for node i induced by the mapping 



4>(i\t,T,C) 



6ip(wj\C{j) = i ) if there existsj G A/"* such thatC(j) = i 
1 — Oi otherwise. 



2.2 Estimating Node Parameters 

We can compute the log-likelihood of the sample data V given the tree-union 
model T and the correspondence mapping function C. Under the assumption 
that the sampling process acts independently on the nodes of the structure the 
log-likelihood is 



Civ \T,C) = EE 

teT> ie 

Our ultimate aim is to optimize the log-likelihood with respect to the corre- 
spondence map C and the tree union model T. These variables, though, are not 
independent since they both depend on the node-set A f. A variation in the ac- 
tual identity and number of the nodes does not change the log-likelihood. Hence 
the dependency on the node-set can be lifted by simply assuming that the node 
set is the image of the correspondence map i.e. Im(C). As we will see later, 
the reason for this is that those nodes that remain unmapped do not affect the 
maximization process. 

We defer details of how we estimate the correspondence map C and the order 
relation O to later sections of the paper. However, assuming estimates of them 
are to hand, then we can make maximum likelihood estimates of the selected 
node model. That is, the set of sampling probabilities O in the unweighted case, 
and the node parameters f and a in the weighted case. 

To proceed, let Ki = {j G M f \ t G V,C(j) = i} be the set of nodes in the 
different trees for which C maps a node to i and let pi = \Ki\ be the number of 
trees satisfying this condition. Further, let rti be the number of trees in V for 
which C results in no mapping to the node i. Using the weighted node model, the 
log-likelihood function can be expressed as the sum of per-node log-likelihood 
functions 



C{V\T,C) = E lo S 

ieN 



^erfc(r^) ni (27rcr^) 




(2) 
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To estimate the parameters of the weight distribution, we take the derivatives 
of the log-likelihood function with respect to cq and ti and set them to zero. 
When rii > 0, we maximize the log likelihood by setting r^° = erfc -1 ^ - n + p ^ , 
and iterating the recurrence: 



r .(k) 



■Xk)_ 



-w 




2 

+ P 



r .(fc+D = r .(fe) _ f(n (k) ,<n (k) ) 

(3) 



where W = J2jeK i w p w2 = HjeK t ( w j) 2 > and /( T *> cr i) = ra*erfc'(T*) + 
Pi erfc(rj) - t^. 



3 Mixture Model 

We now commence our discussion of how to estimate the order relation O for the 
tree union T, and the set of correspondences C needed to merge the sample trees 
to form the tree-union. We pose the problem as that of fitting a mixture of tree 
unions to the set of sample trees. Each tree-union may be used to represent a 
distribution of trees that belong to a single class V c . The defining characteristic 
of the class is the fact that the nodes present in the sample trees satisfy a 
single order relation O c . However, the sample set V may have a more complex 
class structure and it may be necessary to describe it using multiple tree unions. 
Under these conditions the unsupervised learning process must allow for multiple 
classes. We represent the distribution of sample trees using a mixture model over 
separate union structures. Suppose that there are k tree-unions and that the tree 
union for the class c is denoted by and that the mixing proportion for this 
tree-union is a c . The mixture model for the distribution of sample trees is 

k 

P(t\T,C) = J 2 a c~[[ n WfoToC). 

c= i tevie j\f f 

The expected log-likelihood function for the mixture model over the sample- 
set V is: 



k 

C(V\T,C,z) = EEE^ c In 0(z 1 1, T c , C), 
tev ieAft c— l 



where is an indicator variable, that takes on the value 1 if tree t belongs to 
the mixture component c, and is zero otherwise. 

We hence require an information criterion that can be used to select the set of 
tree merge operations over the sample set V that results in the optimal set of tree- 
unions. It is well known that the maximum likelihood criterion cannot be directly 
used to estimate the number of mixture components, since the maximum of the 
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likelihood function is a monotonic function on the number of components. In 
order to overcome this problem we use the Minimum Description Length (MDL) 
principle [13], which asserts that the model that best describes a set of data is 
that which minimizes the combined cost of encoding the model, and, the error 
between the model and the data. The MDL principle allows us to select from a 
family of possibilities the most parsimonious model that best approximates the 
underlying data. 

More formally, the expected descriptor length of a data set V generated by 
an estimate P of the true or underlying model 1~L* is 



E[LL(V,H)\ 




-- - J P(V\n*) log [P(V\H)P(H)] dv = 

~ / p ( p > p *) lo s [ p ( p > p )l dV = 

, H*) log (P(X>, H*)) dV+ j P(V, %*) log dV 

[I(P(V, H*)) + KL(P(V, n*), P(V, H ))} , 



(4) 



where 



i(p(v,n*)) = - J p(v,n*)\og(p(v,H*)) dv 

is the entropy of the joint probability of the data and the underlying model H* % 
and 



KL(P(V, H*), P(D, T-L)) = — j P(P,P*)log( ^ ) ) dV 

is the Kullback-Leiber divergence between the joint probabilities using the un- 
derlying model 1~L* and the estimated model T-L. This quantity is minimized when 
1-L = W, and hence P(V,U) = P(V,U*). 

Under these conditions KL(P(V, H*), P(V, %)) = 0 and E[LL(V,7~L)\ = 
I(P(V,'H)). In other words, the description length associated with the maxi- 
mum likelihood set of parameters is just the expected value of the negative log 
likelihood, i.e. the Shannon entropy. 

As noted above, the cost incurred in describing or encoding the model T 
is — log [P(T )] , while the cost of describing the data V using that model is 
— log [P(V\T )\ . Making the dependence on the correspondences C explicit, we 
have that the description length is LL(V\T ) = —C{V\T,C).. Asymptotically 
the cost of describing the set of mixing components a = {<a c ; c = 1, ...,&} and 
the set of indicator variables z = {z^t G X>, c = 1, ...,&} is bounded by ml (a ), 
where m is the number of samples in V and 1(a) = — Ylc=i a c^-°&( a c) is the 
entropy of the mixture distribution a. We assume that the weight distribution 
is encoded as a histogram. Hence, we commence by dividing the weight space 
of the samples associated with the node i of the union-tree c into buckets of 
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width kerf. As a result, the probability that a weight falls in a bucket centered 
at x is, for infinitesimally small k b l c [x ) = exp[— 1(^ — rf) 2 ]. Hence, the 

asymptotic cost of describing the node parameters rf and of and, at the same 
time, describing within the specified precision the na c samples associated to 
node i in union c, is 



Pi 

LL*(©| T c ,C) = -K-ft) log(l - e\) - Y, log (^K)) • 

3 = 1 

where 0% = 1 — erfc(ri) is the sampling probability for node i andp^ is the number 
of times the correspondence C maps a sample- node to i. Hence (ma c — pi) is the 
number of times node i has not been sampled according to the correspondence 
map C. As a result 



k 

hUV\H, C) = ml {a) + £ £ [LL* (P| T c , C) + i] . (5) 

c=l iEAf c 

where l is the description length per node of the tree-union structure, which we 
set to 1. 

4 Learning the Mixture 

With the description length criterion to hand, our aim is to locate tree merges 
that give rise to the set of tree unions that optimally partition the training 
data V into non-overlapping classes. Unfortunately, locating the global minimum 
of the descriptor length in this way is an intractable combinatorial problem. 
Moreover, the Expectation-Maximization algorithm may not be used since the 
complexity of the maximization step grows exponentially due to the fact that 
the membership indicators admit the possibility that each union can potentially 
include every sample-tree. Hence, we resort to a local search technique, which 
allows us to limit the complexity of the maximization step. The approach is as 
follows. 

— Commence with an overly-specific model. We use a structural model per 
sample-tree, where each model is equiprobable and structurally identical to 
the respective sample-tree, and each node has unit sample probability. 

— Iteratively generalize the model by merging pairs of tree- unions. The candi- 
dates for merging are chosen so that they maximally decrease the descriptor 
length. 

— The algorithm stops when there are no merges remaining that can decrease 
the descriptor length. 

The main requirement of our description length minimization algorithm is 
that we can optimally merge two tree models. Given two tree models 71 and 
75, we wish to construct a union T whose structure respects the hierarchical 
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constraints present in both 7 i and 72, and that also minimizes the quantity 
LL(T). Since the trees 7 i and T2 already assign node correspondences C\ and 
C2 from the data samples to the model, we can simply find a map M. from the 
nodes in 7 i and 72 to T and transitively extend the correspondences from the 
samples to the final model T in such a way that, given two nodes v G and 
v' G A/2, then C(v) = C(v f ) v' = M(v). 

Posed as the merge of two structures, the correspondence problem is reduced 
to that of finding the set of nodes in 7 i and T2 that are common to both trees. 
Starting with the two structures, we merge the sets of nodes that reduces the 
descriptor length by the largest amount, while still satisfying the hierarchical 
constraint. That is we merge nodes u and v of 7 i with node u' and v' of T2 
respectively if and only if u ^ v <^> v! ^ v' , where a ^ b indicates that a is an 
ancestor of b. 

The descriptor length advantage obtained by merging the nodes v and v' is: 

A(v,v’) = LL v (V\Tc,C) + Ll/ (V\Tc,C) - LL (vv '\v\ %,C) + l. ( 6 ) 

The set of merges Ad that minimizes the descriptor length of the combined 
tree-union also maximizes the advantage function 

A{M)= Y, 

(v,v')£j\4 

For each pair of initial mixture components we calculate the union and the 
descriptor length of the merged structure. From the set of potential merges, we 
can identify the one which is both allowable and which reduces the descriptor 
cost by the greatest amount. The mixing proportion for this optimal merge 
is equal to the sum of the proportions of the individual unions. At this point 
we calculate the union and descriptor cost that results from merging the newly 
obtained model with each of the remaining components. We iterate the algorithm 
until no more merges can be found that reduce the descriptor length. 

5 Pattern Spaces from Union Trees 

We can uses the union-trees to embed the shapes of the same class in a pattern 
space using principal components analysis. To do this we place the nodes of the 
union tree % in an arbitrary order. To each sample tree t we associate a pattern- 
vector x t = (aq, . . . ,x n ) T G M n , where n = \Af c \ is the number of nodes in the 
tree model T c • Here x t (i ) = wf if the tree has a node mapped to the i-th node of 
the sample and is zero otherwise. For each union-tree T c we compute the mean 
pattern- vector x c = ^ZteJ\f c x t an d covariance matrix E c = ( x t ~ 

£c){&t — X C ) T where M c is the set of sample trees merged to form the tree 
union T c • Suppose that the eigenvectors (ordered to decreasing eigenvalue) are 
ei, e2, ....ej\f c . The leading l s i g eigenvectors are used to form the columns of the 

matrix E = (ei |e2 1 | e i sig ). We perform PC A on the sample-trees by projecting 

the pattern-vectors onto the leading eigenvectors of the covariance matrix. The 
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Fig. 1. Clusters extracted with a purely-structural mixture of trees approach versus 
pairwise clustering of attributed distances obtained with edit distance and tree union. 



projection of the pattern- vector for the sample tree indexed t is y t = E T x t . The 
distance between the vectors in this space is D PCA (t,t')(y t — yt') T {yt ~ Ut')- 

6 Experimental Results 

We illustrate the utility of the tree-clustering algorithm on sets of shock trees. 
The shock tree is a graph-based representation of the differential structure of the 
boundary of a 2D shape. We augment the skeleton topology with a measure of 
feature importance based on the rate of change of boundary length with distance 
along the skeleton. 



6.1 Clustering Examples 

To illustrate the clustering process, we commence with a study on a small 
database of 25 shapes. In order to asses the quality of the method, we com- 
pare the clusters defined by the components of the mixture with those obtained 
by applying a graph spectral pairwise clustering method recently developed by 
Robles-Kelly and Hancock [14] to the distances between graphs. This method 
locates the clusters by iteratively extracting the eigenvectors from the matrix 
of edit-distances between the graphs. The edit distances are computed in two 
alternative ways. First, we compute weighted edit distance using the method 
outlined in [17]. The second method involves computing the distance matrix 
using the projected vectors by embedding the trees in a single tree union [18]. 
These two distance measures are enhanced with geometrical information linked 
to the nodes of the trees in the form of a node weight. The weight of each node 
is equal to the proportion of the boundary length that generated the skeletal 
branch associated to the node. 

Figure 1 shows the clusters extracted from the database of 25 shapes. The 
first column shows the clusters extracted through the mixture of tree unions 
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Fig. 2. Left: 2D multi-dimensional scaling of the pairwise distances of the shock graphs. 
(The numbers correspond to the shape classes.); Right: Proportion of correct classifica- 
tions obtained with the mixture of tree versus those obtained with pairwise clustering. 



approach, and relies on a purely structural representation of shape. The second 
column displays the clusters extracted from the weighted edit-distances between 
shock-trees; here the structural information is enhanced with geometrical in- 
formation. The third column shows the clusters extracted from the distances 
obtained by embedding the geometrically-enhanced shock-trees in a single tree- 
union. While there is some merge and leakage, the clusters extracted with the 
mixture of tree unions compare favorably with those obtained using the alter- 
native clustering algorithms, even though these are based on data enhanced 
with geometrical information. The second to last cluster extracted using the 
mixture of tree unions deserves some further explanation. The structure of the 
shock-trees of the distinct tools in the cluster are identical. Hence, by using 
only structural information, the method clusters the shock-trees together. To 
distinguish between the objects, geometrical information must be provided too. 
Hence, the two alternative clustering methods are able to distinguish between 
the wrenches, brushes and pliers. 

A more challenging experimental vehicle is provided by a larger database of 
120 trees, which is divided into 8 shape classes containing 15 shapes each. To 
perform an initial evaluation of this database, we have applied multidimensional 
scaling to the weighted edit distances between the shock graphs for the different 
shapes. By doing this we embed points representing the graphs in a low dimen- 
sional space spanned by the eigenvectors of a similarity matrix computed from 
the pairwise distances. In Figure 2 we show the projection of the graphs onto 
the 2D space spanned by the leading two eigenvectors of the similarity matrix. 
Each label in the plot corresponds to a particular shape class. Label 1 identifies 
hands, label 2 horses, label 3 ducks, 4 men, 5 pliers, 6 screwdrivers, 7 dogs, and, 
finally, label 8 is associated with leaves. The plot clearly shows the difficulty of 
this clustering problem. The shape groups are not well separated. Rather, there 
is a good deal of overlap between them. Furthermore, there are a considerable 
number of outliers. 

To asses the ability of the clustering algorithm to separate the shape classes, 
we performed experiments on an increasing number of shapes. We commenced 
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Fig. 3. Principal components analysis of the union embedding of the clusters. 



with the 30 shapes from the first two shape classes, and then increased the 
number of shape classes under consideration until the full set of 120 shapes 
was included. Figure 2 plots the proportion of shapes correctly classified as the 
number of shapes is increased. The solid line plots the result obtained using 
the mixture of weighted tree unions, while the dotted line displays the results 
obtained with pairwise clustering of the weighted edit distances between the 
shapes. The mixture of tree unions clearly outperforms the pairwise clustering 
algorithm. 

We now turn our attention to the results of applying PCA to the union trees, 
as described in Section 5. Figure 3 displays the first two principal components of 
the sample-tree distribution for the embedding spaces extracted from six shape 
classes. In most cases there appears to be a tightly packed central cluster with 
a few shapes scattered further away than the rest. This separation is linked to 
substantial variations in the structure of the shock trees. For example, in the 
shape-space formed by the class of pliers the outlier is the only pair-of-pliers 
with the nose closed. In the case of shape-space for the horse-class, the outliers 
appear to be the cart-horses while the inliers are the ponies. 

7 Conclusions 

In this paper we have presented an information theoretic framework for cluster- 
ing trees and for learning a generative model of the variation in tree structure. 
The problem is posed as that of learning a mixture of tree unions. We demon- 
strate how the three sets of operations needed to learn the generative model, 
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namely node correspondence, tree merging and node probability estimation, can 
each be couched in terms of minimising a description length criterion. We pro- 
vide variants of algorithm that can be applied to samples of both weighted and 
unweighted trees. The method is illustrated on the problem of learning shape- 
classes from sets of shock trees. 

References 

1. C. Cyr and B.Kimia, 3D Object Recognition Using Shape Similarity-Based Aspect 
Graph, ICCV 2001. 

2. S. J. Dickinson, A. P. Pentland, and A. Rosenfeld, 3-D shape recovery using dis- 
tributed aspect matching, PAMI , Vol. 14(2), pp. 174-198, 1992. 

3. N. Friedman and D. Roller, Being Bayesian about Network Structure, Machine 
Learning , to appear, 2002 

4. L. Getoor et ah, Learning Probabilistic models of relational structure, in 8th Int. 
Conf. on Machine Learning , 2001. 

5. D. Heckerman, D. Geiger, and D. M. Chickering, Learning Bayesian networks: the 
combination of knowledge and statistical data, Machine Learning , Vol. 20(3), pp. 
197-243, 1995. 

6. T. Heap and D. Hogg, Wormholes in shape space: tracking through discontinuous 
changes in shape, in ICCV , pp. 344-349, 1998. 

7. G.R. Hjaltason and H. Samet, Properties of embedding methods for similarity 
searching in metric spaces, PAMI(25), pp. 530-549, 2003. 

8. S. Ioffe and D. A. Forsyth, Human Tracking with Mixtures of Trees, ICCV , Vol. I, 
pp. 690-695, 2001. 

9. Y. Keselman, A. Shokoufandeh, M.F. Demirci, and S. Dickinson, Many-to-many 
graph matching via metric embedding, CVPR03(I: 850-857). 

10. B. B. Kimia, A. R. Tannenbaum, and S. W. Zucker, Shapes, shocks, and deforma- 
tions I, International Journal of Computer Vision , Vol. 15, pp. 189-224, 1995. 

11. N. Linial, E. London and Y. Rabinovich, The geometry of graphs and some of 
its applications, 35th Anual Symposium on Foundations of Computer Science, pp. 
169-175, 1994. 

12. M. Meila. Learning with Mixtures of Trees. PhD thesis, MIT, 1999. 

13. J. Rissanen, Stochastic complexity and modeling, Annals of Statistics , Vol. 14, pp. 
1080-1100, 1986. 

14. A. Robles-Kelly and E. R. Hancock. A maximum likelihood framework for iterative 
eigendecomposition. In ICCV , Vol. I, pp. 654-661, 2001. 

15. A. Shokoufandeh, S. J. Dickinson, K. Siddiqi, and S. W. Zucker, Indexing using a 
spectral encoding of topological structure, in CVPR , 1999. 

16. T. Sebastian, P. Klein, and B. Kimia, Recognition of shapes by editing shock 
graphs, in ICCV , Vol. I, pp. 755-762, 2001. 

17. A. Torsello and E. R. Hancock. Efficiently computing weighted tree edit distance 
using relaxation labeling. In EMMCVPR , pp. 438-453, 2001. 

18. A. Torsello and E. R. Hancock, Matching and embedding through edit-union of 
trees. In ECCV, pp. 822-836, 2002. 

19. S. C. Zhu and A. L. Yuille, FORMS: A Flexible Object Recognition and Modelling 
System, IJCV , Vol. 20(3), pp. 187-212, 1996. 




Decision Theoretic Modeling of 
Human Facial Displays 



Jesse Hoey and James J. Little 

Department of Computer Science, University of British Columbia 
2366 Main Mall, Vancouver, BC, CANADA V6T 1Z4 

{ jhoey , little}@cs . ubc . ca 



Abstract. We present a vision based, adaptive, decision theoretic model 
of human facial displays in interactions. The model is a partially observ- 
able Markov decision process, or POMDP. A POMDP is a stochastic 
planner used by an agent to relate its actions and utility function to 
its observations and to other context. Video observations are integrated 
into the POMDP using a dynamic Bayesian network that creates spa- 
tial and temporal abstractions of the input sequences. The parameters 
of the model are learned from training data using an a-posteriori con- 
strained optimization technique based on the expectation-maximization 
algorithm. The training does not require facial display labels on the 
training data. The learning process discovers clusters of facial display se- 
quences and their relationship to the context automatically. This avoids 
the need for human intervention in training data collection, and allows 
the models to be used without modification for facial display learning in 
any context without prior knowledge of the type of behaviors to be used. 
We present an experimental paradigm in which we record two humans 
playing a game, and learn the POMDP model of their behaviours. The 
learned model correctly predicts human actions during a simple cooper- 
ative card game based, in part, on their facial displays. 



1 Introduction 

There has been a growing body of work in the past decade on the communica- 
tive function of the face [1]. This psychological research has drawn three major 
conclusions. First, facial displays are often purposeful communicative signals. 
Second, the purpose is not defined by the display alone, but is dependent on 
both the display and the context in which the display was emitted. Third, the 
signals are not universal, but vary widely between individuals in their physical 
appearance, their contextual relationships, and their purpose. We believe that 
these three considerations should be used as critical constraints in the design 
of communicative agents able to learn, recognise, and use human facial signals. 
They imply that a rational communicative agent must learn the relationships 
between facial displays, the context in which they are shown, and its own util- 
ity function: it must be able to compute the utility of taking actions in situa- 
tions involving purposeful facial displays. The agent will then be able to make 
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value-directed decisions based, in part, upon the “meaning” of facial displays 
as contained in these learned connections between displays, context, and utility. 
The agent must also be able to adapt to new interactants and new situations, 
by learning new relationships between facial displays and other context. 

This paper presents a vision-based, adaptive, Bayesian model of human facial 
displays. The model is, in fact, a partially observable Markov decision process, 
or POMDP [2], with spatially and temporally abstract, continuous observations 
over the space of video sequences. The POMDP model integrates the recogni- 
tion of facial signals with their interpretation and use in a utility-maximization 
framework. This is in contrast to other approaches, such as hidden Markov mod- 
els, which consider that the goal is simply to categorize a facial display. POMDPs 
allow an agent to make decisions based upon facial displays, and, in doing so, 
define facial displays by their use in decision-making. Thus, the POMDP train- 
ing is freed from the curse of labeling training data which expresses the bias of 
the labeler, not necessarily the structure of the task. The model can be acquired 
from data, such that an agent can learn to act based on the facial signals of a 
human through observation. To ease the burden on decision-making, the model 
builds temporal and spatial abstractions of input video data. For example, one 
such abstraction may correspond with the wink of an eye, whereas another may 
correspond to a smile. These abstractions are also learned from data, and allow 
decision making to occur over a small set of states which are accurate temporal 
and spatial summarizations of the continuous sensory signals. 

Our work is distinguished from other work on recognising facial communi- 
cations primarily because the facial displays are not defined prior to learning 
the model. We do not train classifiers for different facial motions and then base 
decisions upon the classifier outputs. Instead, the training process discovers cat- 
egories of facial displays in the data and their relationships with context. The 
advantage of learning without pre-defined labels is threefold. First, we do not 
need labeled training data, nor expert knowledge about which facial motions are 
important. Second, since the system learns categories of motions, it will adapt to 
novel displays without modification. Third, resources can be focused on useful 
tasks for the agent. It is wasteful to train complex classifiers for the recognition 
of fine facial motion if only simple displays are being used in the agent’s context. 

The POMDPs we learn have observations which are video sequences, modeled 
with mixtures of coupled hidden Markov models (CHMMs) [3]. The CHMM is 
used to couple the images and their derivatives, as described in Section 3.1. 
While it is usual in a hierarchical model to commit to a most likely value at a 
certain level [4,5], our models propagate noisy evidence from video at the lowest 
level to actions at the highest, and the choice of actions can be probabilistically 
based upon all available evidence. 

2 Previous Work 

There are many examples of work in computer vision analysing facial displays [6] , 
and human motion in general [7,4]. However, this work is usually supervised, in 
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that models of particular classes of human motion are learned from labeled 
training data. There has been some recent research in unsupervised learning of 
motion models [8,5], but few have attempted to explicitly include the modeling 
of actions and utility, and none have looked at facial displays. Act ion- React ion 
Learning [9] is a system for analysing and synthesising human behaviours. It is 
primarily reactive, however, and does not learn models conducive for high level 
reasoning about the long term effects of actions. 

Our previous work on this topic has led to the development of many parts 
of the system described in this paper. In particular, the low-level computer vi- 
sion system for instantaneous action recognition was described in [10], while the 
simultaneous learning of the high-level parameters was explored in [11]. This 
paper combines this previous work, explicitly incorporates actions and utilities, 
and demonstrates how the model is a POMDP, from which policies of action can 
be extracted. Complete details can be found in [12]. 

POMDPs have become the semantic model of choice for decision theoretic 
planning in the artificial intelligence (AI) community. While solving POMDPs 
optimally is intractable for most real-world problems, the use of approximation 
methods have recently enabled their application to substantial planning prob- 
lems involving uncertainty, for example, card games [13] and robot control [14]. 
POMDPs were applied to the problem of active gesture recognition in [15], in 
which the goal is to model unobservable, non-foveated regions. This work models 
some of the basic mechanics underlying dialogue, such as turn taking, channel 
control, and signal detection. Work creating embodied agents has led to much 
progress in creating agents that interact using verbal and non-verbal communi- 
cation [16]. These agents typically only use a small subset of manually specified 
facial expressions or gestures. They focus instead primarily on dialogue manage- 
ment and multi-modal inputs, and have not used POMDPs. 

3 POMDPs for Facial Display Understanding 

A POMDP is a probabilistic temporal model of an agent interacting with the en- 
vironment [2], shown as a Bayesian network in Figure 1(a). A POMDP is similar 
to a hidden Markov model in that it describes observations as arising from hid- 
den states, which are linked through a Markovian chain. However, the POMDP 
adds actions and rewards, allowing for decision theoretic planning. A POMDP 
is a tuple (S, A, T, R, O, B), where S is a finite set of (possible unobservable) 
states of the environment, A is a finite set of agent actions, T : S x A ^ S is a 
transition function which describes the effects of agent actions upon the world 
states, O is a set of observations, B : S x A — > O is an observation function which 
gives the probability of observations in each state-action pair, and R : S — > 1Z is 
a real- valued reward function, associating with each state 8 its immediate utility 
R(S). A POMDP model allows an agent to predict the long term effects of its 
actions upon his environment, and to choose actions based on these predictions. 
Factored POMDPs [18] represent the state, S', using a set of variables, such that 
the state space is the product of the spaces of each variable. Factored POMDPs 
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Fig. 1 . (a) Two time slices of general POMDP. (b) Two time slices of factored POMDP 
for facial display understanding. The state, S', has been factored into {Bs, Aact , Acom }, 
and conditional independencies have been introduced: Ann’s actions do not depend on 
her previous actions and Ann’s display is independent of her previous action given the 
state and her previous display. These independencies are not strictly necessary, but 
simplify our discussion, and are applicable in the simple game we analyse. 



allow conditional independencies in the transition function, T, to be leveraged. 
Further, T is written as a set of smaller, more intuitive functions. 

Purposeful facial display understanding implies a multi-agent setting, such 
that each agent will need to model all other agent’s decision strategies as part of 
its internal state 1 . In the following, we will refer to the two agents we are model- 
ing as “Bob” and “Ann” , and we will discuss the model from Bob’s perspective. 
Figure 1(b) shows a factored POMDP model for facial display understanding in 
simple interactions. The state of Bob’s POMDP is factored into Bob’s private 
internal state, Bs , Ann’s action, Aact , and Ann’s facial display, Acom , such 
that S t = {Bs t , Aact t , Acom t }. While Bs and Aact are observable, Acom is not, 
and must be inferred from video sequence observations, O. We wish to focus 
on learning models of facial displays, Acom, and so we will use games in which 
Aact and Bs are fully observable, which they are not in general. For example, 
in a real game of cards, a player must model the suit of any played card as an 
unobservable variable, which must be inferred from observations of the card. In 
our case, games will be played through a computer interface, and so these kinds 
of actions are fully observable. 

The transition function is factored into four terms. The first involves only 
fully observable variables, and is the conditional probability of the state at time 
t under the effect of both player’s actions: Os = P(Bs t \Aact t , Bact, Bs t -i). 

1 This is known as the decision analytic approach to games, in which each agent 
decides upon a strategy based upon his subjective probability distribution over the 
strategies employed by other players. 
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The second is over Ann’s actions given Bob’s action, the previous state, and her 
previous display: O a = P(Aact t \Bact, Acomn t -i, The third describes 

Bob’s expectation about Ann’s displays given his action, the previous state 
and her previous display: Od = P(Acom t \Bact, Bs t -i : Acom t -i). The fourth 
describes what Bob expects to see in the video of Ann’s face, O, given his 
high-level descriptor, Acom: &o = P(O t \Acom t ). For example, for some state 
of Acom , this function may assign high likelihood to sequences in which Ann 
smiles. This value of Acom is only assigned meaning through its relationship 
with the context and Bob’s action and utility function. We can, however, look at 
this observation function, and interpret it as an Acom = ’smile’ state. Writing 
C t = { Bact t , Bs t - 1 } , A t = Aact t , and D t = Acom t , the likelihood of a sequence 
of data, {OCA} = {Oi . . . Ot •> C% • • • Ct •> A\ . . . A^}, is 

1 ,T 

p({oca}|0) = V p{o t \d t ^) V e A e D p{p T -^ u {oca}|©) (i) 

b T k l 1,T—1 

where D t: k is the k th value of the mixture state, D, at time t. The observations, 
O, are temporal sequences of finite extent. We assume that the boundaries of 
these temporal sequences will be given by the changes in the fully observable 
context state, C and A. There are many approaches to this problem, ranging 
from the complete Bayesian solution in which the temporal segmentation is 
parametrised and integrated out, to specification of a fixed segmentation time [4]. 

3.1 Observations 

We now must compute P(0\Acom ), where O is a sequence of video frames. 
We have developed a method for generating temporally and spatially abstract 
descriptions of sequences of facial displays from video [10,12]. We give a brief 
outline of the method here. Figure 2 shows the model as a Bayesian network 
being used to assess a sequence in which a person smiles. 

We consider that spatially abstracting a video frame during a human facial 
display involves modeling both the current configuration and dynamics of the 
face. Our observations consist of the video images, /, and the temporal deriva- 
tives, /t, between pairs of images. The task is first to spatially summarise both 
of these quantities, and then to temporally compress the entire sequence to a 
distribution over high level descriptors, Acom. We assume that the face region 
is tracked through the sequence by a separate tracking process, such that the 
observations arise from the facial region in the images only. We use a flow-based 
tracker, described in more detail in [12]. 

The spatial abstraction of the derivative fields involves a projection of the 
associated optical flow field, v, over the facial region to a set of pre-determined 
basis functions. The basis functions are a complete and orthogonal set of 2D 
polynomials which are effective for describing flow fields [12]. The resulting fea- 
ture vector, Z XJ is then conditioned on a set of discrete states, X, parametrised 
by normal distributions. The projection is accomplished by analytically integrat- 
ing the observation likelihood, P(f t |X), over the space of optical flow fields and 
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Fig. 2. A person smiling is analysed by the mixture of CHMMs. Observations, O, are 
sequences of images, /, and image temporal derivatives, ft , both of which are projected 
over the facial region to a set of basis functions, yielding feature vectors, Z x and Z w . 
The image regions, H , are projected directly, while it is actually the optical flow fields, 
V, related to the image derivatives which are projected to the basis functions [10]. 
Z x and Z w are both modeled using mixtures of Gaussians, X and W, respectively. 
The class distributions, X and W, are temporally modeled as mixture, D, of coupled 
Markov chains. The probability distribution over D is at the top. The most likely state, 
D — 1, can be associated with the concept “smile”. Probability distributions over X 
and W are shown for each time step. All other nodes in the network show their expected 
value given all evidence. Thus, the flow field, v, is actually (v) = f vP(v\0). 



over the feature vector space. This method ensures that all observation noise 
is propagated to the high level [10]. The abstraction of the images also uses 
projections of the raw (grayscale) images to the same set of basis functions, re- 
sulting in a feature vector, Z w , which is also modeled using a mixture of normal 
distributions with mixture coefficients W. 

The basis functions are a complete and orthogonal set, but only a small 
number may be necessary for modeling any particular motion. We use a feature 
weighting technique that places priors on the normal means and covariances, so 
that choosing a set of basis functions is handled automatically by the model [10]. 

At each time frame, we have a discrete dynamics state, X, and a discrete 
configuration state, W, which are abstract descriptions of the instantaneous 
dynamics and configuration of the face, respectively. These are temporally ab- 
stracted using a mixture of coupled hidden Markov models (CHMM), in which 
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the dynamics and configuration states are interacting Markovian processes. The 
conditional dependencies between the X and W chains are chosen to reflect the 
relationship between the dynamics and configuration. This mixture model can 
be used to compute the likelihood of a video sequence given the facial display 
descriptor, P(0\ Acorn): 

P({0}\D t ) ^y^PUtlXT^PiltlWT^y^OxijkOwjkiPiXT^^WT-ij {O} I D t ) 

L T ij ki Lt-i 

( 2 ) 

where @x,@w are the transition matrices in the coupled X and W chains, 
and P(f t \XT,i), P(It\WT,j) are the associated observation functions [12]. The 
mixture components, D, are a set of discrete abstractions of facial behavior. It 
is important to remember that there are no labels associated with these states 
at any time during the training. Labels can be assigned after training, as is done 
in Figure 2, but these are only to ease exposition. 



3.2 Learning POMDPs 

We use the expectation-maximization (EM) algorithm [17] to learn the parame- 
ters of the POMDP. It is important to stress that the learning takes place over 
the entire model simultaneously: both the output distributions, including the 
mixtures of coupled HMMs, and the high-level POMDP transition functions are 
all learned from data during the process. The learning classifies the input video 
sequences into a spatially and temporally abstract finite set, Acom , and learns 
the relationship between these high-level descriptors, the observable context, and 
the action. We only present some salient results of the derivation here. We seek 
the set of parameters, 0*, which maximize 



0* = arg max 



X! P ( D I°’ C ’ A > 9 ') lo & p ( D > ° C, A|<9) + log P(6>) 

. D 



(3) 



subject to constraints on the parameters, 0*, that they describe probability 
distributions (they sum to 1). The “E” step of the EM algorithm is to compute 
the expectation over the hidden state, P(D|0, C, A, 0'), given 0', a current guess 
of the parameter values. The “M” step is then to perform the maximization 
which, in this case, can be computed analytically by taking derivatives with 
respect to each parameter, setting to zero and solving for the parameter. 

The update for the D transition parameter, Ooijk = P(Dt,i\Dt-i,jCt,k), is 
then 



®Dijk — 



a Dijk + ^2te{l...N t }\C t =k P(Dt,iDt-l,j |0, A, GO') 



Si a Dijk + Ylte{%*N t }\C t =k P(Dt,iDt-ij\0, A, CO') 



where the sum over the temporal sequence is only over time steps in which Ct = 
&, and aDijk is the parameter of the Dirichlet smoothing prior. The summand 
can be factored as 



P(Dt,iDt-i,j 



|0, A, C0') = Pt,iOAwP(O t \D t ^)0 Di j k a t -ij 
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where a t j = P(D t j{ OAC}) and (3 t ,i = P({OAC}\D t d ) are the usual forwards 

1,1 ’ t+l,T 

and backwards variables, for which we can derive recursive updates 

d-t,j — ^ ^ -P(Qt \Pt,j 'j® A*j Djk*Ott — l,k (3t — l,i — ^ ^ f3t,k ® A*k* -P(Qt | Df,k ) ®Dki* 

k k 

where we write 6buj* = P(A t = *\D t jC t = *) and P(O t \D t d ) is the likelihood 
of the data given a state of the mixture of CHMMs (Equation 2). The updates 
0 A ijk = P{At,i\D t jCt,k) are ®Aijk = ^2te{i...N t }\A t =ivc t ^k w h ere = 
P(D td |OAC) = /3 td a td . The updates to the j th component of the mixture of 
CHMMs are weighted by , but otherwise is the same as for a normal CHMM [3] . 
The complete derivation, along with the updates to the output distributions of 
the CHMMs, including to the feature weights, can be found in [12]. 

3.3 Solving POMDPs 

If observations are drawn from a finite set, then an optimal policy of action can 
be computed for a POMDP [2] using dynamic programming over the space of 
the agent’s belief about the state, b(s). However, if the observation space is con- 
tinuous, as in our case, the problem becomes much more difficult. In fact, there 
are no known algorithms for computing optimal policies for such problems. Nev- 
ertheless, approximation techniques have been developed, and yield satisfactory 
results [14]. Since our focus in this paper is to learn POMDP models, we use the 
simplest possible approximation technique, and simply consider the POMDP as 
a fully observable MDP: the state, S', is assigned its most likely value in the 
belief state, S = argmax s 6(s). Dynamic programming updates then consist of 
computing value functions, E n , where V n (s ) gives the expected value of being 
in state s with a future of n stages to go (horizon of n), assuming the optimal 
actions are taken at each step. The actions that maximize V n are the n stage- 
to-go policy (the policy looking forward to a horizon 3 stages in the future) . We 
use the SPUDD solver to compute these policies [18]. 

4 Experiments 

In order to study the relationships between display recognition and action we 
constrain the structure of an interaction between two humans using rules in a 
computer game. We then observe the humans playing the game and learn models 
of the relationships between their facial motions and the states and actions in the 
game. Subsequent analysis of the learned models reveals how the humans were 
using their faces for achieving value in the game. Our learning method allows such 
games to be analysed without any prior knowledge about what facial displays 
will be used during game play. The model automatically “discovers” what display 
classes are present. We can also compute policies of action from the models. In 
the following, we describe our experiments with a simple card game. Results on 
two other simple games, along with further details on the game here described, 
can be found in [12]. 
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stage 1 stage 2 

Fig. 3. Bob’s game interfaces during a typical round. His cards are face up below the 
“table”, while Ann’s cards are above it. The current bid is shown below Bob’s cards, 
and the winnings are shown along the bottom. The cards along the sides belong to 
another team, which is introduced only for motivation. A bid of hearts in stage 1 is 
accepted by Ann, and both players commit their heart in stage 2. 



4.1 Card Matching Game 

Two players play the card matching game. At the start of a round, each player 
is dealt three cards from a deck with three suits (^,0,4), with values from 1 
to 10. Each player can only see his own set of cards. The players play a single 
card simultaneously, and both players win the sum of the cards if the suits 
match. Otherwise, they win nothing. On alternate rounds ( bidding rounds ), a 
player has an opportunity to send a confidential bid to his partner, indicating a 
card suit. The bids are non-binding and do not directly affect the payoffs in the 
game. During the other rounds ( displaying rounds ), the player can only see her 
partner’s bids, and then play one of her cards. There is no time limit for playing 
a card, but the decision to play a card is final once made. Finally, each player 
can see (but not hear) their teammate through a real-time video link. There 
are no game rules concerning the video link, so there are no restrictions placed 
on communication strategies the players can use. The card matching game was 
played by two students in our laboratory, “Bob” and “Ann” through a computer 
interface. A picture of Bob’s game interface during a typical interaction is shown 
in Figure 3. Each player viewed their partner through a direct link from their 
workstation to a Sony EVI S-video camera mounted about their partner’s screen. 
The average frame rate at 320 x 240 resolution was over 28fps. The rules of the 
game were explained to the subjects, and they played four games of five rounds 
each. The players had no chance to discuss potential strategies before the game, 
but were given time to practice. 

We will use data from Bob’s bidding rounds in the first three games to 
train the POMDP model. Observations are three or four variable length video 
sequences for each round, and the actions and the values of the cards of both 
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players, as shown in Table 1. The learned model’s performance will then be 
tested on the data from Bob’s bidding rounds in the last game. It is possible to 
implement a combined POMDP for both bidding and displaying rounds [12]. 

There are nine variables which describe the state of the game when a player 
has the bid. The suit of each the three cards can be one of 9, <0, Bob’s actions, 
Bad , can be null (no action), or sending a confidential bid (bidV, bid{ ), bidJfr) or 
committing a card (cmtty , cmt§ , cmtJb) • Ann’s observed actions, Aad , can be 
null , or committing a card. The Acom variable describes Ann’s communication 
through the video link. It is one of Nd high-level states, D = di...djsr d , of 
the mixture of CHMMs model described previously. Although these states have 
no meaning in isolation, they will obtain meaning through their interactions 
with the other variables in the POMDP. The number of states, (A^), must be 
manually specified, but can be chosen as large as possible based on the amount 
of training data available. The other six, observable, variables in the game are 
more functional for the POMDP, including the values of the cards, and whether 
a match occurred or not. The reward function is only based upon fully observable 
variables, and is simply the sum of the played card values, if the suits match. 

4.2 Results 

The model was trained with four display states. We inspected the model after 
training, and found that two of the states (di : ds) corresponded to “nodding” 
the head, one (d±) to “shaking” the head, and the last (c^) to a null display 
with little motion. Training with only three clusters merges the two nodding 
clusters together. Figures 4 and 5 show example frames and flows from sequences 
recognized as d 4 (shake) and as d\ (nod) , respectively. The sequences correspond 
to the last two rows in Table 1, in which Ann initially refuses a bid of <^> from 
Bob, then accepts a bid of £. 

Table 2(a) shows a part of the learned conditional probability distribution 
over Ann’s action, Aad , given the current bid and Ann’s display, Acom. We see 
that, if the bid is null, we expect Ann to do nothing in response. If the bid is 

and Ann’s display (Acom) is one of the “nodding” displays d\ or cfe, then we 
expect Ann to commit her V. On the other hand, if Ann’s display is “shaking”, 
^ 4 , then we expect her to do nothing (and wait for another bid from Bob). 

The learned conditional probability distribution of Ann’s display, Acom , at 
time t, given the previous and current bids, bid t - 1 , and bid t , carried two impor- 
tant pieces of information for Bob: First, at the beginning of a round, any bid 
is likely to elicit a non- null display di^ds or d 4 . Second, a “nodding” display is 
more likely after a “shaking” display if the bid is changed. 



4.3 Computing and Using a Policy 

A 3 stage-to-go policy was computed by assuming that the facial display states 
are observable. There are ten possible values for each card, which expands the 
state space and makes it more difficult to learn accurate models from limited 
training data. To reduce this complexity, we approximate these ten values with 
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Table 1 . Log for the first two bidding rounds of one of the training games. A blank 
means the card values were the same as the previous sequence. Ann’s display, Acom , 
is the most likely as classified by the final model. 



round 


frames 


Bob’s cards 


Ann’s 


cards 


bid 


Bob’s act 
Bad 


Ann’s act 
Aact 


Ann’s display 
Acom 


9 0 


* 


9 0 


* 


1 


40-150 


3 4 


7 


2 10 


7 


~ 


bid* 


- 


dz 


1 


151-295 










* 


cmt* 


cmt* 


d i 


2 


725-827 


2 5 


2 


7 3 


8 


~ 


bidO 


- 


gL 


2 


828-976 










0 


bid* 


- 


gL 


2 


977-1048 










* 


cmt* 


cmt* 


di 




Fig. 4. Frames from the second-to-last row in Table 1. This sequence occurred after 
Bob had bid <0, and was recognized as Acom = gL: a head shake. The bottom row shows 
the original images, /, with tracked face region, and the temporal derivative fields, ft. 
The middle row shows the expected configuration, if, and flow field, V (scaled by a 
factor of 4.0 for visibility). The top row shows distributions over W and X. 




Fig. 5. Frames from the last row in Table 1. This sequence occurred after Bob had 
made his second bid of * after Ann’s negative response to his first bid, and was 
recognized as Acom — d\\ a nod. See Figure 4 for more details. 
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Table 2. (a) Selected parts of the learned conditional probability distribution over 

Ann’s action, Aact , given the current bid and Ann’s display, Acom. Even distributions 
are because of lack of training data, (b) Selected parts of policy of action in the card 
matching game for the situation in which BVv = v3 , B^v = v3 and Bf»v — vl. 

(a) (b) 



bid Acom 


Aact 

null cratT cmt^> cmtJ^ 


null 

9 dl,d3 
9 d2 
9 dA 


0.40 0.20 0.20 0.20 
0.20 0.40 0.20 0.20 
0.25 0.25 0.25 0.25 
0.40 0.20 0.20 0.20 



bid 


Acom 


policy Bad 


null 


dl 


bidV 


u 


d2,d3 


bid <0> 


“ 


dA 


cratT 




dl, d2, d3 


cratT 


u 


dA 


bid<) 



three values, ul,u2,u3, where cards valued 1-4 are labeled ul, 5-7 are v2 and 
8-10 are labeled v3. More training data would obviate the need for this approx- 
imation. We then classified the test data with the Viterbi algorithm given the 
trained model to obtain a fully observable state vector for each time step in the 
game. The computed policy was consulted, and the recommended actions were 
compared to Bob’s actual actions taken in the game. The model correctly pre- 
dicted 6/7 actions in the testing data, and 19/20 in the training data. The error 
in the testing data was due to the subject glancing at something to the side of 
the screen, leading to a classification as d 4 . This error demonstrates the need for 
dialogue management, such as monitoring of the subject’s attention [14]. 

Table 2(b) shows a part of the policy of action if the player’s cards have 
values BVv = u3, B()v = v3 and Blftv = vl. For example, if there is no bid on 
the table, then Bob should bid one of the high cards: hearts or diamonds. If the 
bid is hearts and Ann nodded or did nothing (dl,d 2 or d3), then Bob should 
commit his heart. If Ann shook her head, though, Bob should bid the diamond. 

Notice that, in Table 2(b), the policy is the same for Acom = d2,d3. These 
states hold similar value for the agent, and could be combined since their dis- 
tinction is not important for decision making. It is believed that this type of 
learning, in which the state space is reduced for optimal decision making, will 
lead to solution techniques for very large POMDPs in the near future [ 12 ]. 

More complex games typically necessitate longer term memory than the 
Markov assumption we have used. However, POMDPs can accomodate longer 
dependencies by explicitly representing them in the state space. Further, current 
research in logical reasoning in first-order POMDPs will extend these models to 
be able to deal with more complex high-level situations. 

5 Conclusion 

We have presented an adaptive dynamic Bayesian model of human facial displays 
in interactions. The model is a partially observable Markov decision process, or 
POMDP. The model is trained directly on a set of video sequences, and does 
not need any prior knowledge about the expected types of displays. Without 
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any behavior labels, the model discovers classes of video sequences and their 
relationship with actions, utilities and context. It is these relationships which 
define, or give meaning to, the discovered classes of displays. We demonstrate 
the method on videos of humans playing a computer game, and show how the 
model is conducive for intelligent decision making or for prediction. 
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Abstract. We address the problem of selecting a subset of the most rel- 
evant features from a set of sample data in cases where there are multiple 
(equally reasonable) solutions. In particular, this topic includes on one 
hand the introduction of hand-crafted kernels which emphasize certain 
desirable aspects of the data and, on the other hand, the suppression of 
one of the solutions given “side” data, i.e., when one is given informa- 
tion about undesired aspects of the data. Such situations often arise when 
there are several, even conflicting, dimensions to the data. For example, 
documents can be clustered based on topic, authorship or writing style; 
images of human faces can be clustered based on illumination conditions, 
facial expressions or by person identity, and so forth. 

Starting from a spectral method for feature selection, known as Q — a , 
we introduce first a kernel version of the approach thereby adding the 
power of non-linearity to the underlying representations and the choice to 
emphasize certain kernel-dependent aspects of the data. As an alternative 
to the use of a kernel we introduce a principled manner for making use of 
auxiliary data within a spectral approach for handling situations where 
multiple subsets of relevant features exist in the data. The algorithm we 
will introduce allows for inhibition of relevant features of the auxiliary 
dataset and allows for creating a topological model of all relevant feature 
subsets in the dataset. 

To evaluate the effectiveness of our approach we have conducted exper- 
iments both on real-images of human faces under varying illumination, 
facial expressions and person identity and on general machine learning 
tasks taken from the UC Irvine repository. The performance of our al- 
gorithm for selecting features with side information is generally superior 
to current methods we tested (PCA,OPCA,CPCA and SDR-SI). 



1 Introduction 

The problem of focusing on the most relevant measurements in a potentially 
overwhelming quantity of data is fundamental in machine vision and learning. 
Seeking out the relevant coordinates of a measurement vector is essential for 
making useful predictions as prediction accuracy drops significantly and training 
set size might grow exponentially with the growth of irrelevant features. To 
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add complexity to what already is non-trivial, natural data sets may contain 
multiple solutions, i.e., valid alternatives for relevant coordinate sets, depending 
on the task at hand. For example, documents can be analyzed based on topic, 
authorship or writing style; face images can be classified based on illumination 
conditions, facial expressions or by person identity; gene expressions levels can 
be clustered by pathologies or by correlations that also exist in other conditions. 

The main running example that we will use in this paper is that of selecting 
features from an unlabeled (unsupervised) dataset consisting of human frontal 
faces where the desired features are relevant for inter-person variability. The face 
images we will use vary along four dimensions; (i) people identity, (ii) facial ex- 
pressions, (iii) illumination conditions, and (iv) occlusions (see Fig. 1). One could 
possibly select relevant features for each of the three dimensions of relevance — 
the challenge is how to perform the feature selection process on unlabeled data 
given that there are multiple solutions (in this case four different ones Jl 

There are two principal ways to handle this problem. First is by embedding 
the feature selection algorithm into a higher dimensional space using a hand- 
crafted kernel function (the so called “kernel design” effort [11]). By selecting 
the right kernel function it may be possible to emphasize certain aspects of the 
data and de-emphasize others. Alternatively, the second approach is to introduce 
the notion of side information which is to provide auxiliary data in the form of 
an additional dataset which contains only the undesired dimensions of relevance. 
The feature selection process would then proceed by selecting features that en- 
hance general dimensions of relevancy in the main dataset while inhibiting the 
dimensions of relevance in the auxiliary dataset. 



Fig. 1 . 25 out of 
the 26 images in 
the AR dataset for 
three different per- 
sons. Images vary 
not only in person 
identity but also 
in illumination, fa- 
cial expression, and 
amount and type of 
occlusion. 

In this work we address both approaches. We start with the principle of 
spectral-based feature selection (introduced by [19]) and modify it to serve two 
new purposes: (i) endowing the approach with the power of kernel functions, 
satisfying the first approach for enriching the vector representation, and (ii) 
making use of auxiliary data for situations in which multiple subsets of relevant 
features exist in the data. The algorithm we will introduce allows for inhibition 
of relevant features of the auxiliary dataset and allows for creating a topological 
model of all relevant feature subsets in the dataset. The auxiliary dataset we 
consider could come in two different forms: the first being additional data points 
which represent undesired variability of the data, while the second form of side 
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data consists of pairs of points which belong to different classes of variability, 
i.e., are considered far away from each other in the space of selected coordinates. 

Side information (a.k.a “irrelevant statistics” or “background information”) 
appears in various contexts in the literature — clustering [20,1,14] and contin- 
uous dimensionality reduction [15,5]. In this paper we address the use of side 
information in the context of a hard selection of a feature subset. Feature selec- 
tion from unlabeled data differs from dimensionality reduction in that it only 
selects a handful of features which are “relevant” with respect to some infer- 
ence task. Dimensionality reduction algorithms, for example PC A, generate a 
small number of features each of which is a combination of all of the original 
features. In many situations of interest, in visual analysis in particular but also 
in other application domains such as Genomics for instance, it is assumed that 
each process being studied involves only a limited number of features taken from 
a pool of a very large set of measurements. For this reason feature combination 
methods are not as desirable as methods that extract a small subset of features. 
The challenge in the selection process is to overcome the computational burden 
of pruning an exponential amount of feature subsets. The Q — a algorithm [19] 
which we propose using as a basis for our approach handles the exponential 
search space by harnessing the spectral information in such a manner where a 
computationally straightforward optimization guarantees a sparse solution, i.e., 
a selection of features rather than a combination of the original features. 

In the subsection below we will describe the Q — a algorithm which forms the 
background for the work presented in this paper. In Section 2 we derive a kernel 
method version of the Q — a algorithm which enables the representation of high 
order cumulants among the entries of the feature vectors thereby considerably 
strengthening the feature selection methodology. In Section 3 we introduce the 
auxiliary data matrix as a side data and derive the optimization for selecting 
relevant features using the main dataset while inhibiting relevant features from 
the auxiliary dataset. In Section 4 we take the notion of auxiliary dataset a 
step further and form a complete topographical model of the relevant feature 
subsets. The general idea is based on rounds where the relevant features selected 
in previous rounds form “side” information for subsequent rounds. In this manner 
a hierarchical modeling the feature subsets becomes feasible and can be used for 
visualization and data modeling. In Section 5 we make use of another form of 
side information where the auxiliary data consists of pairs of points which belong 
to different classes of variability, i.e., are considered far away from each other in 
the space of selected coordinates. In Section 6 we evaluate the effectiveness of our 
algorithms by experiments on various datasets including real-image experiments 
on our main running example, and also running examples on general machine 
learning tasks taken from the UC Irvine repository. 

1.1 Selecting Relevant Features with the Q — ol Algorithm 

The Q — a algorithm for unsupervised feature selection is based on the as- 
sumption that the selection of the relevant features (coordinates) will result in a 
coherent set of clusters formed by the input data points restricted to the selected 
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coordinates. The clustering score in this approach is measured indirectly. Rather 
than explicitly performing a clustering phase per feature selection candidates, 
one employs spectral information in order to measure the cluster arrangement 
coherency. Spectral algorithms have been proven to be successful in clustering 
[16], manifold learning or dimensionality reduction [12], approximation methods 
for NP-hard graph theoretical questions. In a nutshell, given a selection of fea- 
tures, the strength (magnitude) of the leading k eigenvalues of the affinity matrix 
constructed from the corresponding feature values across the sample data are di- 
rectly related to the coherence of the cluster arrangement induced by the subset 
of selected features. The scheme is described as follows: 

Let the data matrix be denoted by M. The feature values form the rows of M 
denoted by m^, and normalized to unit norm ||rrq|| = 1. Each row vector 

represents a feature (coordinate) sampled over the q trials. The column vectors 
of M represent the q samples (each sample is a vector in R n ). For example, 
a column can represent an image represented by its pixel values and a row 
can represent a specific pixel location whose value runs over the q images. As 
mentioned in the previous section, our goal is to select rows (features) from M 
such that the corresponding candidate data matrix (containing only the selected 
rows) consists of columns that are coherently clustered in k groups. The value 
of k is user dependent and is specific to the task at hand. The challenge in this 
approach is to avoid the exponential number of row selections and preferably 
avoid explicitly clustering the columns of the data matrix per each selection. 

Mathematically, to obtain a clustering coherency score we compute the ” affin- 
ity” matrix of the candidate data matrix defined as follows. Let cq E {0, 1} be 
the indicator value associated with the i’th feature, i.e., cq = 1 if the i’th fea- 
ture is selected and zero otherwise. Let A a be the corresponding affinity matrix 
whose (i, j) entries are the inner-product (correlation) between the i’th and j’th 
columns of the resulting candidate data matrix: A a = J^i= i (sum of 

rank-1 matrices). From algebraic graph theory, if the columns of the candidate 
data matrix are coherently grouped into k clusters, we should expect the leading 
k eigenvalues of A a to be of high magnitude [8,10,2,16]. The resulting scheme 
should therefore be to maximize the sum of eigenvalues of the candidate data 
matrix over all possible settings of the indicator variables 

What is done in practice, in order to avoid the exponential growth of assigning 
binary values to n indicator variables, is to allow cq to receive real values in 
an unconstrained manner. A least-squares energy function over the variables 
cq is formed and its optimal value is sought after. What makes this approach 
different from the “garden variety” soft-decision-type algorithms is that this 
particular setup of optimizing over spectral properties guarantees that the cq 
always come out positive and sparse over all local maxima of the energy function. 
This property is intrinsic rather than being the result of explicit constraints in the 
form of regularizes, priors or inequality constraints. We optimize the following: 



n 

max tidice(Q T A^ A a Q) subject to = 1, Q T Q = I (1) 

Q,(*i 
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Note that the matrix Q holds the first k eigenvectors of A a and that 
trac e(Q T A^A a Q) is equal to the sum of squares of the leading k eigenvalues: 
JN=i \ 2 j. A local maximum of the energy function is achieved by interleaving 
the “orthogonal iteration” scheme [6] within the computation of a as follows: 

Definition 1 ( Q — a Method). Let M be an n x q input matrix with rows 
and some orthonormal q x k matrix i.e., = I. 

Perform the following steps through a cycle of iterations with index r = 1, 2 , ... 

1. Let G be a matrix whose (i, j) components are 

2. Let a ^ be the leading eigenvector of G ^ . 

3. Let A^ = . 

I LetZ^ =A^Q^). 

5. Q^Rf r \ that is , is determined by the “QR” factorization of 
Z(A. 

6. Increment index r and go to step 1. 

Note that steps 4,5 of the algorithm consist of the “orthogonal iteration” module, 

i.e., if we were to repeat steps 4,5 only we would converge onto the eigenvectors 
of A^ r \ However, the algorithm does not repeat steps 4,5 in isolation and instead 
recomputes the weight vector a (steps 1,2,3) before applying another cycle of 
steps 4,5. 

The algorithm would be meaningful provided that three conditions are met: 

1. the algorithm converges to a local maximum, 

2. at the local maximum > 0 (because negative weights are not admissible), 
and 

3. the weight vector a is sparse (because without it the soft decision does not 
easily translate into a hard gene selection). 

Conditions (2) and (3) are not readily apparent in the formulation of the algo- 
rithm (the energy function lacks the explicit inequality constraint cq > 0 and 
an explicit term to “encourage” sparse solutions) but are nevertheless satisfied. 
The key for having sparse and non-negative (same sign) weights is buried in the 
matrix G (step 1). Generally, the entries of G are not necessarily positive (other- 
wise a would have been non-negative due to the Perron-Frobenious theorem) — 
nevertheless due its makeup it can be shown that in a probabilistic manner the 
leading eigenvector of G is positive with probability 1 — o(l). In other words, as 
the number of features n grows larger the chances that the leading eigenvector of 
G is positive increases rapidly to unity. The details of why the makeup of G in- 
duces such a property, the convergence proof and the proof of the ” Probabilistic 
Perron-Frobenious” claim can be found in [19]. 

Finally, it is worth noting that the scheme can be extended to handle the 
supervised situation (when class labels are provided); that the scheme can be 
applied also to the Laplacian affinity matrix; and that the scheme readily applies 
when the spectral gap J2i=i ~ XlJ=/c+i maximized rather than ^2 i=1 A? 
alone. Details can be found in [19]. 
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2 Representing Higher-Order Cumulants Using Kernel 
Methods 

The information on which the Q — a method relies on to select features is con- 
tained in the matrix G. Recall that the criterion function underlying the Q — a 
algorithm is a sum over all pairwise feature vector relations: 

trace(Q T A^A a Q) = a T Ga , 

where G is defined such that Gij = (mjmj)mj QQ T nij. It is apparent that 
feature vectors interact in pairs and the interaction is bilinear. Consequently, 
cumulants of the original data matrix M which are of higher order than two are 
not being considered by the feature selection scheme. For example, if M were 
to be decorrelated (i.e. , MM T is diagonal) the matrix G would be diagonal and 
the feature selection scheme would select only a single feature. 

In this section we employ the ’’kernel trick” to include cumulants of higher 
orders among the feature vectors in the feature selection process. This serves 
two purposes: On one hand the representation is enriched with non-linearities 
induced by the kernel, and on the other hand, given a successful choice of a 
kernel (so called Kernel Design effort [11]) one could possibly emphasize certain 
desirable aspects of the data while inhibiting others. 

Kernel methods in general have been attracting much attention in the ma- 
chine learning literature — initially with the support vector machines [13] 
and later took a life of their own (see [11]). Mathematically, the kernel ap- 
proach is defined as follows: let xi, ...,xj be vectors in the input space, say R q , 
and consider a mapping </>(x) : R q — >> T where T is an inner-product space. 
The kernel-trick is to calculate the inner-product in T using a kernel function 
k : R q x R q — >> i?, fc(xi,Xj) = </>(xi) T </>(xj), while avoiding explicit mappings 
(evaluation of) </>(). Common choices of kernel selection include the d’th or- 
der polynomial kernels /c(x^, Xj) = (x^Xj + c) d and the Gaussian RBF kernels 
fc(xi,Xj) = exp(— ^2 ||xi — Xj|| 2 ). If an algorithm can be restated such that the 
input vectors appear in terms of inner-products only, one can substitute the 
inner-products by such a kernel function. The resulting kernel algorithm can 
be interpreted as running the original algorithm on the space T of mapped ob- 
jects </>(x). Kernel methods have been applied to the support vector machine 
(SVM), principal component analysis (PC A), ridge regression, canonical corre- 
lation analysis (CCA), QR factorization and the list goes on. We will focus below 
on deriving a kernel method for the Q — a algorithm. 



2.1 Kernel Q — oc 

We will consider mapping the rows of the data matrix M such that the 
rows of the mapped data matrix become (j)( mi) T , ..., 0(m n ) T . Since the entries 
of G consist of inner-products between pairs of mapped feature vectors, the 
interaction will be no longer bilinear and will contain higher-order cumulants 
whose nature depends on the choice of the kernel function. 
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Replacing the rows of M with their mapped version introduces some chal- 
lenges before we could apply the kernel trick. The affinity matrix A a = JT cq 
0(m^)0(m^) T cannot be explicitly evaluated because A a is defined by outer- 
products rather than inner-products of the mapped feature vectors The 

matrix Q holding the eigenvectors of A a cannot be explicitly evaluated as well 
and likewise the matrix Z = A a Q (in step 4). As a result, kernelizing the Q — a 
algorithm requires one to represent a without explicitly representing A a and 
Q both of which were instrumental in the original algorithm. Moreover, the in- 
troduction of the kernel should be done in such a manner to preserve the key 
property of the original Q — a algorithm of producing a sparse solution. 

Let V = MM t be the n x n matrix whose entries are evaluated using the 
kernel vij = Let Q = M T E for some n x k (recall k being the 

number of clusters in the data) matrix E. Let D a = diag(a i,...,<a n ) and thus 
A a = M T D a M and Z = A a Q = M T D a VE. The matrix Z cannot be explicitly 
evaluated but Z T Z = E T V D a V D a V E can be evaluated. The matrix G can be 
expressed with regard to E instead of Q: 

Gij = (0(m i ) T 0(m j ))0(m i ) T Q(5 T (/)(m i ) 

= &(rrq, (M T E)(M T E) T 

= &(rrq, EE T Wj 

where vi, ..., v n are the columns of V. Step 5 of the Q — a algorithm consists of 
a QR factorization of Z. Although Z is uncomputable it is possible to compute 
R and R -1 directly from the entries of Z T Z without computing Q using the 
Kernel Gram-Schmidt described in [18]. Since Q = ZR~ X = M T D a V ER~ X the 
update step is simply to replace E with ER~ l and start the cycle again. In other 
words, rather than updating Q we update E and from E we obtain G and from 
there the newly updated a. The kernel Q — a is summarized below: 

Definition 2 (Kernel Q — a). Let M be an uncomputable matrix with 
rows 0(mi) T , ..., 0(m n ) T . The kernel function is given by <p(mi) T <f>(mj) = 
&(m^,nij). The matrix V = MM T is a computable n x n matrix. Let E^ 
be an nx k matrix selected such that M T E^ has orthonormal columns. Iterate 
over the steps below, with the index r = 1 , 2 ,... 

1. Let G^ be a n x n matrix whose (i,j) components are 
k(m.i, m j)vj £ , ( r_1 )£ , ( r_1 ) T Vj. 

2. Let be the largest eigenvector of G^ r \ and let D ^ = diag(oL^\ ...,<ai r ^). 

3. Let Z be an uncomputable matrix 

Z (r) = (M T D (r) M)(M T £ ,(r - 1) ) = M T D (r) U£ ,(r “ 1) . 



I 

5. 

6 . 



Z^) QR. It is possible to compute directly R, R 1 from the entries of 
the computable matrix Z^ Z ^ without explicitly computing the matrix Q 
(see [18]). 

Let E ( r ) = E^-VR- 1 . 

Increment index r and go to step 1. 
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The result of the algorithm is the weight vector a and the design matrix 
G which contains all the data about the features. The drawback of the kernel 
approach for handling multiple structures of the data is that the successful choice 
of a kernel depends on the user and is largely an open problem. For example, with 
regard to our main running example it is unclear which kernel to choose that 
will strengthen the clusters induced by inter-personal variation and inhibit the 
clusters induced by lighting facial expressions. We therefore move our attention 
to the alternative approach using the notion of side data. 

3 Q — ex with Side Information 

Consider the n x q data matrix M defined above as the “main” data. We are 
given an auxiliary n x p data matrix W with rows w^, ..., wj representing p 
data points comprising the “side” information. Our goal is to select a subset 
of coordinates, namely, determine the weight vector a such the affinity matrix 
JT oii has coherent k clusters (measured by the sum of squares of the 
first k eigenvalues) whereas JT has low cluster coherence. The desire 

for low cluster coherence for the side information can be represented by small 
variance of each coordinate value along the p samples. Namely, if m* is selected 
as a relevant feature of the main data, we should expect that the corresponding 
side feature vector will have a small variance. Small variance of the selected 
rows of W means that the corresponding affinity matrix JT represents 

a single cluster (whether coherent or not is immaterial). 

To clarify the logic behind our approach, consider the scenario presented in 
[5] . Assume we are given face images of 5 individuals covering variability of illu- 
mination and facial expressions — a total of 26 images per individual. The main 
data matrix M will contain therefore 130 columns. We wish to select relevant 
features (rows of M), however, there are three dimensions of relevancy: (i) per- 
son identity, (ii) illumination direction, and (iii) facial expressions. One could 
possibly select relevant features for each dimension of relevance and obtain a co- 
herent clustering in that dimension. Say we are interested in the person identity 
dimension of relevance. In that case the auxiliary matrix W will contain 26 im- 
ages of a 6th individual (covering facial expressions and illumination conditions). 
Features selected along the dimensions of facial expression or illumination will 
induce coherent clusters in the side data, whereas features selected along the 
person identity dimension will induce a single cluster (or no structure at all) in 
the side data — and low variance of the feature vectors is indicative to single 
cluster or no structure at all. In formal notations we have the following: 

Let D = diag(var(wj), ...,uar(wj)) be a diagonal matrix with the variance 
of the rows of W. The low coherence desire over the side data translates to 
minimization of a 1 Da. Taken together, we have a Rayleigh quotient type of 
energy function to maximize: 

trace(Q T A^A a Q) a T Ga 

Q,a* a T (Z) + Ai> = a T (Z) + AJ)a (2) 
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subject to = 1 ? Q T Q = I 

i= 1 

where G is the matrix defined above whose entries are: Gij = (mj mj)mj 
QQ T nij. The scalar A > 0 is user-setabale with the purpose of providing the 
tradeoff between the main data and the side data. Large values of A translates to 
low weight for the side information in the feature selection scheme. A vanishing 
value A = 0 is admissible provided that none of the variances vanishes ( D has 
no vanishing entries along its diagonal) — in that case equal weight is given to 
the two sources of data. The Q — a with side information algorithm becomes: 

Definition 3 (Q — a with Side Information). Let M be an nxq input matrix 
with rows m^, ...,m^, W be an n x p “side” matrix where the variance of its 
rows form a diagonal matrix D, and is some orthonormal q x k matrix, 
i.e., Q Q = I. Perform the following steps through a cycle of iterations 
with index r = 1,2, ... 

1. Let G be a matrix whose (i,j) components are 

(mj Q^ r ~^ Q( r_1 ) T nij . 

2. Let be the largest eigenvector of ( D + XI)~ 1 G^ r \ 

3. Let A (r) = Yh=i a- r) niim7. 

I Let Z ( r ) = AWqC-i). 

5. Z M -^4 QW/jW. 

6. Increment index r and go to step 1. 

Note the change in step 2 compared to the Q — a algorithm. Since D + XI is a 
diagonal positive matrix, its inverse is also positive therefore the positivity of G 
is not affected. In other words, the properties of G which induce a positive (and 
sparse) solution for the weight vector a (see [19]) are not negatively affected 
when G is multiplied with a positive diagonal matrix. If D were not diagonal, 
then D~ l would not have been positive and the optimized a values would not 
come out positive and sparse. 

4 Topographical Model of All Relevant Feature Subsets 

We can further extend the notion of “negative variability” embedded in the side 
information to a wider perspective of representing a hierarchy of feature subsets 
extracted iteratively. The general idea is to treat the weight vector a (which 
determines the feature selection as it is a sparse positive vector) as representing 
axes of negative variability for subsequent rounds. Let a be the feature selection 
solution given by running the Q — a algorithm. We wish to run Q — a again while 
looking for an alternative solution along a different dimension of variability We 
construct a “side information” matrix D whose diagonal is D = diag(a \ , ..., a^) 
and run the Q — <a-with-SI algorithm. The new weight vector a' will be encour- 
aged to have high values in coordinates where a has low values. This is applied 
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iteratively where in each round Q — a- with- SI is executed with the matrix D 
containing the sum of square a values summed over all previous rounds. 

Furthermore, the G matrix resulting from each round of the above scheme can 
be used for generating a coordinization of the features as a function of the implicit 
clustering of the (projected) data. The weight vector a is the largest eigenvector 
of G, but as in Multi-Dimensional-Scaling (MDS), the first largest eigenvectors of 
G form a coordinate frame. Assume we wish to represent the selected features by 
a ID coordinate. This can be achieved by taking the first two largest eigenvectors 
of G thereby each feature is represented by two coordinates. A ID representation 
is made by normalizing the coordinate-pair (i.e., each feature is represented by 
a direction in the 2D MDS frame). Given r rounds, each feature is represented 
by r coordinates which can be used for visualization and data modeling. 

An example of such a topographical map is shown in figure 2. The data 
matrix consists of 150 data points each described by 20 features out of which 
9 are relevant. The relevant features form two possible solution sets where each 
solution induces three clusters of data points. The first set consists of three 
features marked by “1,2,3”, while the second set consists of three different fea- 
tures marked by “A,B,C”. Three additional features marked by “1A,2B,3C” were 
constructed by summing the corresponding feature vectors 1,2,3 and A,B,C, re- 
spectively. The remaining 11 (irrelevant) features were constructed by randomly 
permuting the entries of the first feature vector. We ran Q — a twice creating for 
each feature two coordinates (one per each run) as described above. In addition 
to the coordinization of each feature we have associated the corresponding a 
value as a measure of “relevancy” of the feature per solution. Taken together, 
each feature is represented by a position in the 2D topographical map and a 2D 
magnitude represented by an ellipse whose major axes capture the respective 
a values. The horizontal axis in Fig. 2(b) is associated with the solution set of 
features “1,2,3” and the vertical axis with the solution set “A,B,C”. We see that 
the hybrid features 1A,2B,3C, which are relevant to both cluster configurations, 
have equal (high) relevancy in both sets (large circles in the topographical map). 




Fig. 2. (a) A synthetic dataset used to demonstrate the creation of a topographical 
model of the features (b) The resulting topographical model (see text). 
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5 Pairwise Side Information 

Another possible variant of Q — <a-SI is when the side information is given over 
pairs of “negative” data points. Consider the (adapted) problem raised by [20] in 
the context of distance metric learning for clustering: we are given a set of data 
points forming a data matrix M (the “main” data). As side information we are 
given pairs of points x^,Xj which are known to be part of different clusters. We 
wish to select features (coordinates) such that the main data contains maximally 
coherent clusters while obeying the side information (i.e., features are selected 
such that for each of the “side” pairs (x^Xj ) 2 is small). 

We can incorporate the side information by constructing a side matrix B 
which functions similarly to the diagonal matrix D we constructed in the previ- 
ous sections. The difference here would be that B is not diagonal and therefore 
needs to be handled differently. Consider a pair of side points x, y. We wish to 
find the weight vector a such that: (x T y ) 2 = ( JA <a i x iVi ) 2 = cx T Fa is small, 
where F rs = x r y r x s y s . Denote by F l i the matrix corresponding to the pair of 
side points x$, Xj and let B = JA JA F l F 

Our goal is to maximize the spectral information coming from the main data 
(as before) while minimizing a T B a. We are back to the same framework as in 
Sections 3 and 4 with the difference that B is not diagonal therefore the product 
B~ 1 G is not guaranteed to obey the properties necessary for the weight vector a 
to come out positive and sparse. Instead, we define an additive energy function: 

max trace(Q T A ~^A a Q) — A a T Ba (3) 

Q,oii 

n 

subject to = 1> Q T Q = I 

i = 1 

This energy function is equivalent to a T (G — A B)a where A tradeoffs the weight 
given to the side data. The algorithm follows the steps of the Q — a algorithm 
with the difference in step 2: “ a ^ is the largest eigenvector of G ^ — A BP 
There is an open issue of showing that a comes out positive and sparse. The 
matrix G is “dominantly positive”, i.e., when treated as a random matrix each 
entry has a positive mean and thus it can be shown that the probability of a 
positive a asymptotes at unity very fast with n [19]. The question what happens 
to the mean when one subtracts A B from G. Our working assumption is that the 
entries of B are significantly smaller than the corresponding entries of G because 
the inner-products of the side points should be small — otherwise they wouldn’t 
have been supplied as side points. Empirical studies on this algorithm validate 
this assumption and indeed a maintains the positivity and sparsity properties 
in our experiments. 

6 Experiments 

We present below three types of experiments (i) simulations on synthetic data 
for the purpose of studying the effects of different weightings of the side data, 
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(ii) our main running example on the AR face dataset, and (iii) various examples 
taken from the UC Irvine repository of data sets. Due to space constraints the 
synthetic simulations are given only in the technical report [17]. 

Face images. Our main running example is the selection of features from 
an unlabeled data set of face images taken from the AR dataset [7]. The dataset 
consists of 100 people with 26 images per person varying according to lighting 
direction and facial expressions. Our task is to select those features which are 
relevant for distinguishing between people identities only. The dataset contains 
three dimensions of relevancy, and the use of side data is crucial for inhibiting 
the unwanted dimensions of facial expressions and lighting variations. Following 
[5] we adopted the setting where the main data set contained the images of 5 
randomly chosen men (out of the 50 men) totaling 130 images. The side dataset 
consisted of the 26 images of a random sixth man. The feature selection process 
Q — a — SI looks for coordinates which maximize the cluster coherence of the 
main dataset while minimizing the variance of the coordinate vectors of the side 
data. As a result, the selected coordinates are relevant for separating among 
person identities while being invariant to the other dimensions of variability. 
The task of clustering those images into the five correct clusters is hard since the 
nuisance structures (such as those generated by variation of lighting and facial 
expressions) are far more dominant than the structure of person variability. 

The feature values we use as a representation of the image is designed to 
capture the relationship between average intensities of neighboring regions. This 
suggests the use of a family of basis functions, like the Haar wavelets, which 
encode such relationships along different orientations (see [9,4]). In our imple- 
mentation the Haar wavelet transform is run over an image and results in a set of 
5227 coefficients at several scales that indicate the response of the wavelets over 
the entire image. Many of the coefficients are irrelevant for the task of separating 
between facial identities and it is therefore the goal of the Q — a — SI to find 
those coefficients that represent the relevant regions. 

To quantify the performance of our algorithm in a comparative manner we 
used the normalized precision score introduced in [5,15] which measures the av- 
erage purity of the k-Nearest Neighbors for varying values of k. We compared the 
performance to four methods: PC A which is the most popular technique for di- 
mensionality reduction, Constrained PCA (CPCA) and Oriented PCA (OPCA) 
[3], and Sufficient Dimensionality Reduction with Side Information (SDR-SI) [5]. 
All but the first method (PCA) utilize the same side data as the Q — a — SI. 
Also worth noting that all the methods we compared to extract features by 
combinations of the original features rather than just select features. 

Optimal parameters (dimensionality and A) for all methods were chosen to 
maximize the precision index for a training set. The wavelet decomposition was 
not optimal for the other methods and therefore the raw image intensities were 
used instead. Reported results were obtained on a separate test set. The entire 
procedure was repeated 20 times on randomly chosen subsets of the AR database. 

Fig. 3a shows the results averaged over 20 runs. The precision index is nor- 
malized between 0 to 1 where 0 is obtained with random neighboring and 1 
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Fig. 3. (a) Comparison of the normalized precision index between CPCA, PCA, OPCA, 
SDR-IS, and Qa — SI on the AR dataset, (b) Sorted feature weights (a values) for 
each of the 20 runs showing the sparsity of the feature selection (c) The average image 
of all men in the AR dataset projected to the selected features for each one of the 20 
runs, (d) For a specific run: each row contains the images of one person projected onto 
the selected feature space. 



when all nearest neighbors are of the same class. Note that the precision index 
of Q — a — SI is 0.64 which is significantly higher than 0.39 obtained by the next 
best method (SDR-SI). Fig. 3(b) shows the resulting a values sorted separately 
at each one of the 20 runs. As can be seen those values are extremely sparse - 
having only few of the feature weights above a very clear threshold at each run. 

Fig. 3(c) illustrates the selected features by the Q — a — SI at each run. This is 
done by synthesizing (reconstructing) the images from their wavelet coefficients 
weighted by the a values. What is shown per run is the average male image. 
Fig. 3(d) shows the projection of random faces from a specific run to the weighted 
features space. Each row contains images of one person. In both figures (c,d) some 
characteristic features of each individual (beard, dark glasses frame, distinctive 
hair line) are highlighted, while the illumination differences are reduced. 

Finally, it is worth noting that our attempts to find an appropriate kernel 
which will perform as well as the side data approach were unsuccessful. Our 
experiments show that the kernel Q — a has significant advantages over Q — a in 
general, but selecting an appropriate kernel for the multiple structure paradigm 
is a hard problem and is left open (see [11] for work on kernel design). 

UC Irvine Repository Tests. We also applied our method to several 
datasets from the UC Irvine repository. On each dataset we applied k- means 
clustering on the raw data and on features provided by PCA, OPCA and CPCA. 
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An accuracy score was computed for each clustering result similarly to what was 
done in [20]. The results are shown for the dermatology, segmentation, wine and 
ecoli datasets. We also tested the algorithm on the glass, Boston- housing and 
arrhythmia datasets where non of the algorithms were significantly better than 
chance. The results are summarized in the table below. Each report result is an 
average of several experiments where, at turns, each class served as side infor- 
mation and the other classes were taken to be the main dataset. The features 
were weighted, combined or selected according to the algorithm in question, and 
then the data points were clustered by k-means. Each result shown in the table 
was averaged over 20 runs. The number of features used for each PCA variants 
was the one which gave the best average accuracy. The parameter A used in the 
Q — a with side information was fixed at A = 0.1. 



Dataset 


raw data 


Q-ol SI 


PCA 


CPCA 


OPCA 


dermatology 


0.5197 


0.8816 


0.5197 


0.6074 


0.8050 


ecoli 


0.6889 


0.7059 


0.6953 


0.6973 


0.5620 


segmentation 


0.7157 


0.7817 


0.7208 


0.7089 


0.7110 


wine 


0.7280 


0.9635 


0.7280 


0.7280 


0.9493 



The Q — a-Sl performed the best over all the experiments we conducted. 
In some of the datasets constrained PCA or oriented PCA performed only 
slightly worse, but non of these methods gave good results consistently in all 
four datasets. Unlike PCA and its variants, the Q — a algorithm tends to pro- 
duce a sparse selection of features, showing a large preference toward a small 
number of features. For example, in the wine dataset the a values corresponding 
to the features Alcohol and Proline were three times larger than the rest. 
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Abstract. We present a novel approach to modelling the non-linear and time- 
varying dynamics of human motion, using statistical methods to capture the char- 
acteristic motion patterns that exist in typical human activities. Our method is 
based on automatically clustering the body pose space into connected regions ex- 
hibiting similar dynamical characteristics, modelling the dynamics in each region 
as a Gaussian autoregressive process. Activities that would require large numbers 
of exemplars in example based methods are covered by comparatively few motion 
models. Different regions correspond roughly to different action-fragments and 
our class inference scheme allows for smooth transitions between these, thus mak- 
ing it useful for activity recognition tasks. The method is used to track activities 
including walking, running, etc., using a planar 2D body model. Its effectiveness 
is demonstrated by its success in tracking complicated motions like turns, without 
any key frames or 3D information. 



1 Introduction 

Tracking and analyzing human motion in video sequences is a key requirement in several 
applications. There are two main levels of analysis: ( i ) detecting people and tracking their 
image locations; and (ii) estimating their detailed body pose, e.g. for motion capture, 
action recognition or human-machine-interaction. The two levels interact, as accurate 
detection and tracking requires prior knowledge of pose and appearance, and pose esti- 
mation requires reliable tracking. Using an explicit body model allows the state of the 
tracker to be represented as a vector of interpretable pose parameters, but the problem 
is non-trivial owing to the great flexibility of the human body, which requires the mod- 
elling of many degrees of freedom, and the frequent non-observability of many of these 
degrees of freedom in monocular sequences owing to self-occlusions and depth ambi- 
guities. In fact, if full 3D pose is required from monocular images, there are potentially 
thousands of local minima owing to kinematic flipping ambiguities [18]. Even without 
this, pervasive image ambiguities, shadows and loose clothing add to the difficulties. 

Previous work: Human body motion work divides roughly into tracking based ap- 
proaches, which involve propagating the pose estimate from one time step to another, 
and detection based approaches, which estimate pose from the current image(s) alone. 
The latter have become popular recently in the form of ‘exemplars’ [21] and ‘key frames’ 
[19]. These methods allow the direct use of image data, which eliminates the need for 
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LEARNING TRACKING 




Fig. 1 . Overview of the learning and tracking components of our algorithm (see text). 



predefined parametric models. But the interpretability of parametric models is lost, and 
large numbers of exemplars are needed to cover high dimensional spaces such as those 
of human poses. (Tree-based structures have recently been explored for organizing these 
datasets [20] , but they rely on the existence of accurate distance metrics in the appearance 
space). 

Within the tracking framework, many methods are based on computing optical flow 
[9,3,2], while others optimize over static images ( e.g . [18]). On the representation side, 
a variety of 2D and 3D parametric models have been used [9,3,16,18], as well as non- 
parametric representations based on motion [4] or appearance [15,1 1,21]. A few learning 
based methods have modelled dynamics [8,17,14], motion patterns from motion capture 
data (e.g. [1]), and image features [16,7,6]. To track body pose, Howe etal [8] and Siden- 
bladh et al [17] propose plausible next states by recovering similar training examples, 
while Pavlovic et al [14] learn a weak dynamical model over a simplified 8-parameter 
body for fronto-parallel motions. We extend the learning based approach by modelling 
complex high dimensional motions within reduced manifolds in an unsupervised setting. 
In the past, nonlinear motion models have been created by combining Hidden Markov 
Models and Linear Dynamical Systems in the multi-class dynamics framework, e.g. 
in [13,14]. However, this approach artificially decouples the switching dynamics from 
the continuous dynamics. We propose a simpler alternative that avoids this decoupling, 
discussing our philosophy in section 3.4. 

Problem formulation: We use a tracking based approach, representing human motions 
in terms of a fixed parametric body model controlled by pose-related parameters, and fo- 
cusing on flexible methods for learning the human dynamics. We specialize to monocular 
sequences using a 2D (image based) body model, but our methods extend immediately 
to the 3D and multicamera cases. Our main aim is to study how relationships and con- 
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Fig. 2. (a) Human pose parametrization in the Scaled Prismatic Model, (b) Examples of different 
poses of the complete SPM. Each limb segment is overlayed with its corresponding template 
shape. 



straints in parameter space can be learned automatically from sample trajectories, and 
how this information can be exploited for tracking. Issues to be handled include the 
‘curse of dimensionality’, complex nonlinear motions, and transitions between different 
parts of the space. 

Overview of approach: Our approach is based on learning dynamical models from 
sample trajectories. We learn a collection of local motion models (Gaussian autore- 
gressive processes) by automatically partitioning the parameter space into regions with 
similar dynamical characteristics. The mixture of dynamical models is built from a set 
of hand-labelled training sequences as follows: (i) the state vectors are clustered using 
K-means and projected to a lower dimensional space using PC A to stabilize the sub- 
sequent estimation process; (ii) a local linear autoregression for the state given the p 
previous reduced states is learned for each cluster (p = 1,2 in practice); (in) the data is 
reclustered using a criterion that takes into account the accuracy of the local model for 
the given point, as well as the spatial contiguity of points in each model; ( iv) the models 
are refitted to the new clusters, and the process is iterated to convergence. 

We sidestep the difficult depth estimation problem by using a purely 2D approach, 
so our dynamical models are view dependent. Our tracking framework is similar to 
Covariance Scaled Sampling [18]: well-shaped random sampling followed by local op- 
timization of image likelihood. Figure 1 illustrates the basic scheme of dividing the 
problem into learning and tracking stages. 



2 Body Representation 

We choose a simple representation for the human body: a modified Scaled Prismatic 
Model [12] that encodes the body as a set of 2D chains of articulated limb segments. This 
avoids 3D ambiguities while still capturing the natural degrees of freedom. Body parts 
are represented by rounded trapezoidal image templates defined by their end widths, 
and body poses are parametrized by their joint angles and apparent (projected) limb 
lengths. Including limb lengths, joint angles and hip and shoulder positions, our model 
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contains 33 parameters, giving 33-D state vectors x = (0i , d \ , O 2 , ^ 2 , • • • 0 n , d n ) . Figure 

2 illustrates the parametrization and shows some sample poses. 

Three additional parameters are used during tracking, two for the image location of 
the body centre and one for overall scale. We learn scale and translation independently 
of limb movements, so these parameters are not part of the learned body model. The 
template for each body part contains texture information used for model-image matching. 
Its width parameters depend on the subject’s clothing and physique. They are defined 
during initialization and afterwards remain fixed relative to the overall body scale, which 
is actively tracked. 

3 Dynamical Model Formulation 

Human motion is both complex and time- varying. It is not tractable to build an exact 
analytical model for it, but approximate models based on statistical methods are a po- 
tential substitute. Such models involve learning characteristic motions from example 
trajectories in parameter space. Our model learns the nonlinear dynamics by partition- 
ing the parameter space into distinct regions or motion classes, and learning a linear 
autoregressive process covering each region. 

3.1 Partitioning of State Space 

In cases where the dynamics of a time series changes with time, a single model is often 
inadequate to describe the evolution in state space. To get around this, we partition the 
state space into regions containing separate models that describe distinct motion patterns. 
The partitions must satisfy two main criteria: (i) different motion patterns must belong 
to different regions; and (ii) regions should be contiguous in state space. I.e ., we need 
to break the state space into contiguous regions with coherent dynamics. Coherency 
means that the chosen dynamical model is locally accurate, contiguity that it can be 
reliably deduced from the current state space position. Different walking or running 
styles, viewpoints, etc ., tend to use separate regions of state space and hence separate 
sets of partitions, allowing us to infer pose or action from class information. 

We perform an initial partitioning on unstructured input points in state space by using 
K-means on Mahalanobis distances (see fig. 3). The clusters are found to cut the state 
trajectories into short sections, all sections in a given partition having similar dynamics. 
The partitions are then refined to improve the accuracies of the nearby dynamical models. 
The local model estimation and dynamics based partition refinement are iterated in an 
EM-like loop, details of which are given in section 3.3. 

3.2 Modelling the Local Dynamics 

Despite the complexity of human dynamics and the use of unphysical image-based 
models, we find that the local dynamics within each region is usually well described by 
a linear Auto-Regressive Process (ARP): 

v 

** = $> x t-i + w t + v t 
i = 1 



(i) 
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Fig. 3. (a) The initial partition of the state space of a walking motion (5 cycles), projected to 2-D 
using PC A (see text), (b) The clusters correspond to different phases of the walking cycle, here 
illustrated using the variations of individual joint angles with time. (The cluster labels are coded 
by colour). These figures illustrate the optimal clustering obtained for ap=l ARP. For p- 2, a single 
class suffices for modelling unidirectional walking dynamics. 



Here, x t £ M m is the pose at time t (joint angles and link lengths), p is the model order 
(number of previous states used), A* are m x m matrices giving the influence of x t _^ 
on x t , w t £ M m is a drift/offset term, and v t is a random noise vector (here assumed 
white and Gaussian, v t ~ A((0, Q)). 

The choice of ARP order is strongly dependent on the nature of the motions exhibited 
by the system. In practice, experiments on different kinds of motion showed that a second 
order ARP usually suffices for human tracking: 



X t = Ai x t _! + A 2 x t _ 2 + v t (2) 

This models the local motion as a mass- spring system (set of coupled damped harmonic 
oscillators). It can also be written in differential form: x t = Bi x t + B 2 x t + v t . 



3.3 Model Parameter Estimation 

The parameters to be estimated are the state- space partitioning, here encoded by the 
class centers c k , and the ARP parameters {A^, . . . A£, Q k } within each class (fc = 

1 ... K). There are standard ways of learning ARP models from training data [10]. We 
compute maximum likelihood parameter estimates. We also want to take advantage of 
the well- structured nature of human motion. People rarely move their limbs completely 
independently of one another, although the actual degree of correlation depends on the 
activity being performed. This can be exploited by learning the dynamics with respect to 
a reduced set of degrees of freedom within each class, i.e. locally projecting the system 
trajectories into a lower dimensional subspace. Thus, within each partition, we: 

1. reduce the dimensionality using linear PC A (in practice to about 5); 

2. learn an ARP model in the reduced space; 

3. “lift” this model to the full state space using the PC A injection; 

4. cross- validate the resulting model to choose the PCA dimension. 
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full state prediction 

I 

velocity based lifting 

lifted PCA subspace 



PCA-based lifting 

reduced state space 

reduced state prediction 



reduced dynamical model 



Fig. 4. Using a reduced dynamical model to predict states in a high-dimensional space. A given 
state is projected onto a low-dimensional space using PCA, within which a linear autoregressive 
progress is used to predict a current (reduced) state. This is then lifted back into full state space to 
estimate a noise model in the high-dimensional space. To prevent the state from being continually 
squashed into the PCA subspace, we lift the velocity prediction and not the state prediction. 



The basic scheme is illustrated in figure 4, and the complete algorithm is given below. 
Before applying PCA, the state- space dimensions need to be statistically normalized. 
This is done by dividing each dimension by its observed variance over the complete set 
of training data. 

Algorithm for estimation of maximum-likelihood parameters 

1 . Initialize the state-space partitions by K-means clustering based on scaled (diagonal 
Mahalanobis) distance. 

2. Learn an autoregressive model within each partition. 

3. Re-partition the input points to minimize the dynamical model prediction error. If 
the class assignments have converged, stop. Otherwise go to step 2. 

Step 2 above is performed as follows 

1. Reduce the vectors in the class to a lower dimensional space by: 

a) Centering them and assembling them into a matrix (by columns): 

X = [ (x Pl — c) (x P2 — c) • • • (x Pm — c) ], where p\ . . . p m are the indices of the 
points in the class and c is the class mean. 

b) Performing a Singular Value Decomposition of the matrix to project out the 
dominant directions: X = U D V T . 

c) Projecting each vector into the dominant subspace: each x^ £ M m is represented 
as a reduced vector = U T (x^ — c) in M m ( m ' < m ), where U is the matrix 
consisting of the first m! columns of U. 

2. Build an autoregressive model, q = Yli = i ^ q t-i, and estimate by writing this 
in the form of a linear regression: 

Qt A qt_i , t t Pl , tp 2 , . . . tp n 



( 3 ) 
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where 



A — ( Ai A2 • • • Ap ), q t-% 






XQt-p ) 



3 . Estimate the error covariance Q from the residual between {5q } and {x^ } by “lifting” 
qp back into m dimensions: 



x t = c + Uq t (4) 

Step 3 above is performed as follows: The K-means based partitions are revised by 
assigning training points to the dynamical model that predicts their true motion best, 
and the dynamical models are then re-leamed over their new training points. This EM / 
relaxation procedure is iterated to convergence. In practice, using dynamical prediction 
error as the sole fitting criterion gives erratic results, as models sometimes “capture” 
quite distant points. So we include a spatial smoothing term by minimizing: 

^ (prediction error) + A • (number of inter-class neighbors) 

training points 



where A is a relative weighting term, and the number of inter-class neighbors is the 
number of edges in a neighborhood graph that have their two vertices in different classes 
(i.e., a measure of the lack of contiguity of a partition). 

3.4 Inter-class Transitions 

Many example-based trackers use discrete state HMMs (transition probability matrices) 
to model inter-cluster transitions [21,20]. This is unavoidable when there is no state 
space model at all ( e.g . in exemplars [21]), and it is also effective when modelling time 
series that are known to be well approximated by a set of piecewise linear regimes [5]. 
Its use has been extended to multi-class linear dynamical systems exhibiting continuous 
behavior [14], but we believe that this is unwise, as the discrete transitions ignore the 
location-within-partition information encoded by the continuous state, which strongly 
influences inter-class transition probabilities. To work around this, quite small regions 
have to be used, which breaks up the natural structure of the dynamics and greatly 
inflates the number of parameters to be learned. In fact, in modelling human motion, 
the current continuous state already contains a great deal of information about the likely 
future evolution, and we have found that this alone is rich enough to characterize human 
motion classes, without the need for the separate hidden discrete state labels of HMM 
based models. 

We thus prefer the simpler approach of using a mixture of linear dynamical models 
over an explicit spatial partition, where the ‘class’ label is just the current partition cell. 
More precisely, we use soft partition assignments obtained from a Gaussian mixture 
model based at the class centres, so the dynamics for each point is a weighted random 




Tracking Articulated Motion Using a Mixture of Autoregressive Models 



61 




Fig. 5. Graphical models for inter-class transitions of a system, (a) An HMM-like mixed-state 
model, and (b) our inter-class transition model (zc observation, x*: continuous state, kc discrete 
class). Transitions in an HMM are learned as a fixed transition probability matrix, while our model 
allows location- sensitive estimation of the class label by exploiting continuous state information. 



mixture over the models of nearby partitions. Our classes cover relatively large regions of 
state space, but transitions typically only occur at certain (boundary) areas within them. 
Constant transition probabilities given the current class label would thus be inappropriate 
in our case. 

Figure 5 compares the two schemes in graphical form. By modelling the class-label to 
be conditional on continuous state, we ensure a smooth flow from one model to the next, 
avoiding erratic jumps between classes, and we obviate the need for complex inference 
over a hidden class-label variable. 



4 Image Matching Likelihood 

At present, for the model-image matching likelihood we simply use the weighted sum- 
of- squares error of the backwards- warped image against body-part reference templates 
fixed during initialization. Occlusions are handled using support maps. Each body part 
P has an associated support map whose j th entry gives the probability that image pixel 
j currently ‘sees’ this part. Currently, we use hard assignments, p(j sees P) G {0, 1}. 
To resolve the visibility ambiguity when two limbs overlap spatially, each pose has an 
associated limb-ordering , which is known a priori for different regions in the pose space 
from the training data. This information is used to identify occluded pixels that do not 
contribute to the image matching likelihood for the pose. We charge a fixed penalty for 
each such pixel, equal to the mean per-pixel error of the visible points in that segment. 
Some sample support maps are shown in figure 8(b). 

5 Tracking Framework 

Our tracking framework is similar to Covariance Scaled Sampling [18]. For each mode of 
x t _ i, the distribution A/"(x t , Q) estimated by the dynamical model (1,5) is sampled, and 
the image likelihood is locally optimized at each mode. State probabilities are propagated 
over time using Bayes’ rule. The probability of the tracker being in state (pose) x t at 
time t given the sequence of observations Z t = {z t , z t _i . . . z 0 } is: 



p(x t \Z t ) = p(x t oc p(z t |x t )p(x t |2 t _j) 
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Fig. 6. Results from tracking athletic motion (frames 0,4,8,12,16,20,24). The tracker was trained 
on a different athlete performing a similar motion. Strong priors from the dynamical model allow 
individual limbs to be tracked in the presence of a confusing background. Note that the left arm is 
not tracked accurately. This is due to the fact that it was occluded in the initial image and hence no 
information about its appearance was captured in the template. However, the dynamics continue 
to give a good estimate of its position. 

where X t is the sequence of poses {x^} up to time t and 

p(x t \Z t -i) = J P( x t | X t -i)p(Xt-i I Zt-i)dX t -i (5) 

The likelihood p(z t \ x t ) of observing image z t given model pose x t is computed based 
on the image-model matching error. The temporal prior P(x t | X t ~i) is computed from 
the learned dynamics. In our model, the choice of discrete class label k t is determined 
by the current region in state space, which in our current implementation depends only 
on the previous pose x t _ i, enabling us to express the probability as 

p(x t \X t -i) = p(x t \X t _ 1 ,k t )p(k t |x t _i) (6) 

The size and contiguity of our dynamical regions implies that p(k t | xt_i) is usually 
highly unimodal. The number of modes increases when the state lies close to the boundary 
between two or more regions, but in this case, the spatial coherence inherited from the 
training dynamics usually ensures that any of the corresponding models can be used 
successfully, so the number of distinct modes being tracked does not tend to increase 
exponentially with time. For each model k = 1 ... K, we use a Gaussian posterior for 
p(&|x t ): p{k | x t ) oc e -((*t-c k )z (x t - Cfe ))/2 w h ere Ck [ s the center of the k th class. 

Note that with a second order ARP model, p(x t | Xt- 1 ) = p{*t | x t _i, xt- 2 ). 

6 Results 

We demonstrate our technique by learning models for different classes of human motion 
and using them to track complete body movements in unseen video sequences. Here, 
we present results from two challenging sequences. 

1. Fast athletic motion: This is a case where traditional methods typically fail due 
to high motion blur. A hand-labelled sequence covering a few running cycles is used to 
train a model and this is used to track a different person performing a similar motion. 
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Fig. 7. (a) Dynamical model prediction error w.r.t. number of motion-classes in the turning exper- 
iment. Minimizing the validation error selected 3 classes, corresponding to the two walking direc- 
tions and turning between them, (b) The influence of spatial regularization when re-partitioning 
the state space. A weak regularization A ~ 0.1 gives the optimal dynamical estimates. A larger A 
causes the partition to remain too close to the suboptimal initial K-means estimate. 



For a given viewing direction, we find that a single 2nd order autoregressive process 
in 5 dimensions suffices to capture the dynamics of such running motions. A tracking 
example is shown in figure 6. 

2. Switching between turning and walking: This experiment illustrates the effec- 
tiveness of our inter-class transition model. A 300-frame sequence consisting of walking 
in different directions and turning motion is used as training data. Our learning algorithm 
correctly identifies 3 motion patterns (see figure 7(a)), corresponding to two different 
walking directions and turning between them. The frames corresponding to the centers 
of these 3 classes are shown in figure 8(a). While tracking a new sequence, the model 
correctly shifts between different classes enabling smooth switching between activities. 
Figure 8(c) shows complete tracking results on an unseen test sequence. 

In both cases, the models were initialized manually (we are currently working on 
automatic initialization), after which only the learned dynamics and appearance infor- 
mation were used for tracking. Position and scale changes were modelled respectively 
as first and zeroth order random walks and learned online during tracking. This allows 
us to track sequences without assuming either static or fixating cameras, as is done in 
several other works. The dynamical model alone gives fairly accurate pose predictions 
for at least a few frames, but the absence of clear observations for any longer than this 
may cause mistracking. 

Figure 7(b) shows how repartitioning (step 3 of our parameter estimation algorithm) 
improves on the initial K-means based model, provided that a weak smoothing term is 
included. 



7 Conclusion 

We have discussed a novel approach to modelling dynamics of high degree-of-freedom 
systems such as the human body. Our approach is a step towards describing dynamical 
behavior of high-dimensional parametric model spaces without having to store extremely 
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(c) 

Fig. 8. Examples from our turning experiment, (a) Poses characterizing the 3 motion classes 
learned, (b) Support maps illustrating occlusion information for the 3 classes (color coded by 
body part), (c) Tracking results (every 6th frame from 0-66). The corresponding state vectors 
show a smooth transition between the turning and walking models. 

large amounts of training data. It takes advantage of local correlations between motion 
parameters by partitioning the space into contiguous regions and learning individual local 
dynamical behavior within reduced dimensional manifolds. The approach was tested on 
several different human motion sequences with good results, and allows the tracking 
of complex unseen motions in the presence of image ambiguities. The mixture-based 
learning scheme developed here is practically effective, and scalable in the sense that it 
allows models for different actions to be built independently and then stitched together 
to cover the complete ‘activity space’ . The learning process can also be made interactive 
to allow annotation of different classes for activity recognition purposes. 

In terms of future work, the appearance model needs to be improved. Adding detec- 
tors for characteristic human features and allowing the appearance to evolve with time 
would help to make the tracker more robust and more general. Including a wider range 
of training data would allow the tracker to cover more types of human motions. 

An open question is whether non-parametric models could usefully be incorporated 
to aid tracking. Joint angles are a useful output, and are probably also the most appropriate 
representation for dynamical modelling. But it might be more robust to use comparison 
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with real images, rather than comparison with an idealized model, to compute likelihoods 
for joint-based pose tracking. 
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Abstract. Volumetric structures are frequently used as shape descrip- 
tors for 3D data. The capture of such data is being facilitated by devel- 
opments in multi- view video and range scanning, extending to subjects 
that are alive and moving. In this paper, we examine vision-based model- 
ing and the related representation of moving articulated creatures using 
spines. We define a spine as a branching axial structure representing 
the shape and topology of a 3D object’s limbs, and capturing the limbs’ 
correspondence and motion over time. 

Our spine concept builds on skeletal representations often used to de- 
scribe the internal structure of an articulated object and the significant 
protrusions. The algorithms for determining both 2D and 3D skeletons 
generally use an objective function tuned to balance stability against the 
responsiveness to detail. Our representation of a spine provides for en- 
hancements over a 3D skeleton, afforded by temporal robustness and cor- 
respondence. We also introduce a probabilistic framework that is needed 
to compute the spine from a sequence of surface data. 

We present a practical implementation that approximates the spine’s 
joint probability function to reconstruct spines for synthetic and real 
subjects that move. 



1 Introduction 

We are interested in the detection and tracking of features in volumetric images. 
Volume images capture shape as a temporal sequence of boundary voxels or other 
forms of 3D surfaces. Specifically, we wish to address situations where the subject 
is known to have and is exercising an articulated structure. This assumption 
grants us use of a specific class of geometric modeling solutions. The various 
methods for skeletonizing 2D and 3D images share the objectives of identifying 
extrema, features with some geometric significance, and capturing the spatial 
relationships between them [9]. Skeletons, much like generalized cylinders [4,21], 
serve the purpose of abstracting from raw volume or surface data to get higher 
level structural information. 

We propose that evaluating volumetric data of a subject over time can dis- 
ambiguate real limbs from noisy protrusions. In a single image, knowledge of the 
specific application alone would dictate the noise threshold to keep or cull small 
branches of the skeleton. Many such algorithms exist. In the case of articulated 
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Fig. 1 . (A) Articulated subject, (B) reconstructed surface, (C) extracted skeleton, (D) 
spine graph limbs encoding motion over time; nodes labeled for illustration only. 



moving subjects, the volumetric images change but the underlying structure 
stays the same. We hypothesize that the parts of the skeleton within each image 
that are consistent over time more reliably capture the subject’s structure. To 
this end, we introduce our notion of spines. 

As defined in [4], a generalized cylinder is a surface obtained by sweeping a 
planar cross section along an axis, or space curve. To represent a body made of 
multiple generalized cylinders, we need to merge axes of the different limbs into 
one branching axial structure. The branching structure can be represented by a 
graph, G (Limb Boundaries, Limbs), where edges are limbs, leaf nodes are end 
effectors, and the remaining nodes (all of degree > 2) are limb junctions (see 
Figure ID). So far, we have described the general formulation of a skeleton [5]. 
To parameterize the motion of a skeleton, we express the new spine graph as a 
function over time: 

Spine t = F(G,t). (1) 

For a given time t, the limbs of G will be in a specific pose, captured by 
F’s mapping of G’s topology to axial curves in 3D - a single skeleton. When 
estimating a data set’s spine in the subsequent sections, we will constrain F to 
manipulate the limbs of a G that represents a series of topologically consistent 
skeletons. These skeletons are determined as probable given the input data. 

The implementation of our algorithm is a modular pipeline. It first reduces 
the complexity of multi- view video data to voxels, further to polygons, and finally 
to spines. The resulting model captures the original degrees of freedom needed 
to play back the subject’s motions (see Figure 1). 

2 Related and Motivating Work 

The 2D analogue to our problem is the tracking of correspondence in medial 
axes, which were first introduced by Blum [5]. Given any of the numerous 2D 
skeletonizing techniques, including the classic grassfire models based on distance 
and the more robust area-based techniques [3], the work of Sebastian et al. [23] 
can determine correspondence by minimizing edit-distances of skeleton graphs 
in 2D. 

The medial axes of 3D surfaces are not directly applicable because they 
generate 2D manifold “sheets” through a surface. While medial scaffolds can be 
calculated fairly robustly [24,19], they require further processing [28] to estimate 
good ID axes. 
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Several 3D skeletonization algorithms have been developed using 3D Voronoi 
cells to partition the space within a mesh [2,13,25,12,16]. The cell- walls of these 
convex polyhedra land at equal distances from their designated surface start- 
points - some at or near the medial axis. This approach, with various extensions 
of projection and pruning, can generally serve to synthesize axes. In contrast 
to these, our approach and implementation are based on two sub-domains of 
solutions: measuring of geodesic distance from geometric modeling, and principal 
curves from statistics. 

Geodesic Distance: In Section 4.1 we will discuss in greater detail how a 
surface can be treated as a piecewise continuous distance field that separates 
features from each other. Verroust and Lazarus [27] used such a technique to 
determine axes of symmetry within limbs, and how to connect them to critical 
points (special topological features) on the mesh surface. In an application not 
requiring branching axes, Nain et al. [22] used geodesic distances on colon models 
to determine center-lines for virtual colonoscopy navigation. Recently, a geodesic 
distance based metric was used by Katz and Tal [17] to help assign patches as 
members of explicit limbs, resulting in course animation control-skeletons. All 
these approaches benefit from works such as [15] which identify extrema, or fea- 
tures that protrude from or into a surface mesh. Our approach uses such extrema- 
finding and a geodesic distance metric to better model skeleton branching. 

Principal Curves: Hastie and Stuetzle [14] defined principal curves as pass- 
ing through the middle of a multidimensional data set, as a representation of 
self-consistency to generalize principal components. For fixed length curves in a 
geometric setting, Kegl et al. [18] showed how to minimize the squared distance 
between the curve and points sampled randomly from the encompassing shape. 
Most recently, [7] and [8] extended this notion of principal curves to 3D, for- 
malizing the problem as an optimization which also seeks to minimize the curve 
length. Our extension is to incorporate branching and temporal correspondence. 

3 Spine Formulation and Estimation 

We build on the axial representation of generalized cylinders of [8,7] because 
of their elegant mathematical formulation. They treat the regression problem 
of finding a single curve for a surface as the minimization of a global energy 
function. Much like the previous work on principal curves [14,18], they seek to 
minimize the total distance from the axial curve to the surface. But in addition, 
[7] incorporates a term which penalizes the curve’s length. This augmentation 
helps force the shorter curve to smoothly follow the middle of a surface, instead 
of, for example, spiraling through all the boundary points. 

For our spine formulation, we seek to further incorporate: (a) skeletons S 
that model branching curves of individual surfaces X and (b) data captured 
over a period of time T. We propose a discriminative probabilistic approach to 
computing spines by finding G, 5, and limb end effectors F, which maximize: 

P(G, S 1:T , E 1:T \X 1:T ) = P(G\S 1:T , E 1:T , X 1:T ) • P(S 1:T , E 1:T \X 1:T ) (2) 
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To compute and optimize the joint probability P(Si : t,Ei:t\Xi : t) requires 
searching over all skeletons over all time simultaneously. In order to make the 
solution more computationally tractable, we make the assumption that S t and 
E t are independent of St' and E t > \/{t' ^ £), given X t : 

T 

P{G, S 1:T , E 1:T \X 1:T ) w P(G\Si-.t, E 1:T , X 1:T ) ■ P(S t , E t \X t ) (3) 

t= 1 

This assumption can lead to temporal inconsistencies that can be resolved 
once G is estimated (as shown in Section 4.2). We use a bottom-up approach 
that individually approximates each S t and E t individually, and then estimates 
G. Ideally, we would like to estimate G, 5, and E using an EM-like algorithm by 
iterating back and forth between estimates of G and (St, E t ). However, we have 
found that the greedy estimate of S and E, while noisy, is sufficient to determine 
a G consistent with the subject’s limb topology. 

4 Temporally Constrained Branching Spines 

In this section, we will start by describing our method for locating the set of end 
effectors E t and extracting a branching skeleton graph from a single 3D surface 
X t . Using this or other techniques, we can generate an individual skeleton S t 
at each time £, 1 < t < T. These (S t ,E t ) will be inherently noisy, as a result 
of being calculated independently for each t. In Section 4.2, we describe how 
we combine these individual and often overly complex graphs into a consistent, 
representative spine for the entire time sequence. 

The fairly significant attention given to the problem of building a single 
branching 3D skeleton includes numerous approaches. After experimenting with 
portions of several of these [20,15], we have developed our own extension to the 
level-set method of [27]. In theory, any 3D skeleton- finding technique would be 
suitable, if it meets the following requirements: 

1. Is self-initializing by automatically finding extrema E t . 

2. Generates a principal curve leading to each extremum. 

3. Constructs internal junctions of curves only as necessary to make a connected 

tree. 

More precision might be achieved with more iterations or other techniques, 
but these might only further improve the results of applying our general proba- 
bilistic framework of (3). We proceed to explain our greedy method for obtaining 
a 3D branching skeleton S t from a surface, with just one iteration of maximizing 
(3)’s second term followed by correspondence tracking. 



4.1 Creating a Skeleton for a Single Surface 

Once we have a 3D surface X t for volumetric image (or frame) £, we want to 
extract a skeleton from it. We accomplish this goal in two stages. First we find the 
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tip of each extremity and grow a skeleton from it. Then we merge the resulting 
skeletons to maximize the presence of the highest quality portions of each. In 
terms of maximizing P(S t , E t \X t ), we are first finding a set of candidates for the 
end effectors of E t and the limbs of St- We then pick from these the combination 
that is optimal with respect to our probability metric. 

Growing Skeletons: This part of our algorithm is based on the work of [27]. 
Starting at a seed point on an extremity of the mesh, they sweep through the sur- 
face vertices, labelling each with its increasing geodesic distance. These distances 
are treated as a gradient vector field, which is in turn examined for topological 
critical points. The critical points are used as surface attachment sites for virtual 
links (non-centered) between the axes when the mesh branches. 

But for our purposes, we want a skeleton that always traverses through the 
middle of the subject’s extremities. Locating meaningful extremal points is itself 
an open problem, though the difficulties are generally application specific. Much 
like the above algorithm which has one source, the vertices of a surface mesh can 
be labelled with their average geodesic distance (AGD) to all other points. Sur- 
face points thus evaluated to be local extrema of the AGD function correspond 
to protrusions. Knowledge of the expected size of “interesting” protrusions can 
be used as a threshold on which local maxima qualify as global extrema. 

Hilaga et al. [15] address the significant computational cost of finding the 
AGD by approximating it with uniformly distributed base seed-points. Applying 
the simpler base-point initialization of [27,10] in a greedy manner located the 
desired candidates for E t for our data sets. 

Instead of the separate distance and length terms minimized by [7], we use 
the isocontours of geodesic distance to build level sets that serve as our error 
metric. The vertices of the mesh are clustered into those level-sets by quantizing 
their distances from the seed point into a fixed number of discrete bins (usually 
100). Figures 2C-D illustrate this process. Each skeleton node is constructed by 
minimizing the distance between the vertices in the level set and the node, i.e., 
the centroid of the vertices. 

By walking along edges of the surface graph from the seed point’s level set 
toward the last one, skeleton- nodes are added and progressively connected to 
each other. Figure 3 A illustrates this process in 2D. This approach successfully 
creates a tree graph of nodes, or skeleton, which represents the central axes and 
internal branching points of genus zero meshes. 

The skeleton-generation algorithm is repeated for each of the other limb-tips, 
producing a total of five skeleton-graphs for the starfish example (see Figure 2). 
These are our candidates for the best St for this X t . Note that the most com- 
pact level-sets usually appear as tidy cylindrical rings on the limb where that 
respective skeleton was seeded. 

Merging Skeletons: All of the constituent skeletons St serve as combined 
estimates of the mesh’s underlying limb structure. The best representation of 
that structure comes from unifying the most precise branches of those skeletons 
- the ones with smallest error, or equivalently, maximum P(St, E t \X t ). A high 
quality skeleton node best captures the shape of its “ring” of vertices when the 
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Fig. 2. Example of generating a skeleton for a synthetic starfish mesh. (A) Capture 
images of the starfish from a variety of vantage points (B) Extract a 3D surface using 
generalized voxel carving and improved marching cubes (C) Starting at one extremity 
tip, calculate geodesic distances for each vertex (D) Quantize distances and cluster 
vertices into bins of the same distance (E) Create a skeleton by walking through the 
progression of level set rings (F) Repeat C-E for each tip and merge into a single 
representative skeleton. 



ring is short and has small major and minor axes. With this metric, we calculate 
a cost function C for each node in the constituent skeletons: 



Ci = 



G \ + G \ + a| 

# of points in ring i 



( 4 ) 



The a quantities come from singular values of the decomposition P = U pTpVp, 
where P represents the mean-centered coordinates of the points pi in this ring. 
Note that the resulting vectors in Vp = {vi|v 2 |v 3 } t will usually represent 
the ring’s major, minor, and central axes. Replacing v 3 with vi x v 2 produces 
a convenient local right-hand coordinate frame for each node. 

Each chain of bi-connected nodes represents a limb. To assemble the single 
representative graph of this frame, we copy the best version of each limb available 
in the constituent skeletons. Limb quality Qp is measured as: 



N 



Q l = n-J2Cu 

1 



( 5 ) 



where N is the total number of nodes in limb L. Since nodes from different skele- 
tons are being compared through (5), the Ci s must be normalized by dividing 
them all by the ma x(C*) of all the skeletons. 

Figure 3B illustrates a novel algorithm that we developed to generate limb- 
correspondences for topologically perturbed tree graphs of the same structure. 
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One Spine Node per branch of Level Set 
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Fig. 3. (A) 2D example of clustering connected vertices into bins of similar geodesic 
distance and walking through the resulting level set rings. (B) In the right figure, the red 
and green skeletons represent the same “creature,” possibly seeded from two different 
places. Wishing to copy nodes from the best limbs each constituent skeleton has to 
offer, we developed a leaf-node seeking topology matching algorithm that recognizes 
that these pairs of three-way junctions should be a single four- way junction. 

There appears to be no previously established graph theoretic solution for this 
problem, and our approach is simply: 

1. Tag all limb-tips that we are confident of as Supernodes ; i.e. nodes on both 
color graphs located at [A, B, C, D] correspond to each other. 

2. Traversing inward, the next encountered branch- node in each graph also cor- 
responds to that of the other color: walking from supernode A, the skeleton- 
nodes at the square-symbols should be grouped into a supernode of their 
own. From C, the circles will form a supernode. Iterating this process from 
the outside inward will reveal that the circle and square supernodes should 
be merged into a four-way metanode , which would serve as the point of 
unification when merging limbs from the red and green skeletons. 

4.2 Correspondence Tracking 

Now that we can estimate a single skeleton that represents one volumetric image, 
we adapt the process to handle a sequence of volumes. All the measurements from 
the sequence of X\-t are now abstracted as simplifying the first 

term in (3) to P{G\S\-TiEi-.t)- Finding the G that maximizes this probability 
eliminates extraneous limbs which might have resulted from overfitting. The 
danger of overfitting exists because skeleton elements may be created in support 
of surface-mesh elements that looked like protrusions in that frame only. 

Our 3D correspondence problem of finding the best G is significantly easier 
to automate than trying to perform surface- vertex matching between two dense 
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meshes of the sequence. Assuming the subject grows no new appendages and 
with no other priors, we can safely choose the appropriate number of tips to be 
the most frequently observed number of limb tips. This number of tips, or leaf 
nodes in G, is K = the mode of \E t \, 1 < t < T (see Figure 7). 

Knowing how many appendages to look for, we spatially align each ex- 
ploratory skeleton from the sequence with respect to its temporal neighbors 
to reveal the \E t \ — K superfluous tips that should be culled. We start with all 
the subsequences of frames that already have the correct number of tips K , and 
tag the frame from the middle of the largest such cluster as the reference frame; 
allowing that longer sequences may need to automatically select multiple refer- 
ence frames. Each frame is then processed in turn, constructing a combinatorial 
list of possible tip-correspondences between the reference tips A and the tips in 
the current frame B. Each possible mapping of B — A is evaluated using the 
point-cluster alignment algorithm of [1]. Their technique aligns point clouds as 
much as possible using only translation and rotation. The combination with the 
smallest error, E m i n , is kept as the correct assignment, where 

K 

E = J2\\Bk-&A k -T\\ 2 . (6) 

k = 1 



Here R and T are the least-squares optimal rotation and translation. T sim- 
ply comes from the alignment of the point clouds’ centroids. R is calculated 
by maximizing the Tr ace (RH), where H is the accumulated point correlation 
matrix: 

K 

H = Y J A k Bl. (7) 

k=l 

By decomposing H = XJrEr V^, the optimal rotation is: 

R = V fl U£. (8) 

After assigning the tips of all these frames, we apply the same error metric 
to try out the combinations of tip-assignments with frames having alternate 
numbers of tips. However, these frames are compared to both the reference 
frame and the frame nearest in time with K tips. This brute- force exploration of 
correspondence is computationally tractable and robust for creatures that exhibit 
some asymmetry and have a reasonable number of limbs (typically < 10). 



4.3 Imposing a Single Graph on the Spine 

With the known trajectories of corresponding limb tips throughout the sequence, 
we can re-apply the skeleton merging technique from Section 4.1. This time 
however, we do not keep all the limbs as we did in the exploratory phase, only 
those that correspond to the K limb-tips. The results of this portion of the 
algorithm are pictured in Figure 4 and discussed further in Section 5. 
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A: Without refinement B: With temporal constraint 



Fig. 4. Refinement through imposing of correspondence into the sequence. 



Except for the frames of the sequence where the subject’s limbs were hid- 
den or tucked too close to the body, we can expect the topology of skeletons 
throughout the sequence to be identical. The most frequently occurring topol- 
ogy is established as G, and corresponds to the first term in 3. This correspon- 
dence and trajectory information allows us to construct a single character spine 
for playback of the whole sequence of poses by parameterizing on each limb’s 
length. Each topologically consistent limb of the skeleton sequence is resampled 
at the same interval producing a single spine. 

5 Experiments and Results 

We tried our algorithm on a variety of small creatures after building a data- 
capture stage that would both be comfortable for our subjects and minimize 
the need for video segmentation beyond chromakeying. Twenty video cameras 
were attached to an aluminum exoskeleton shaped roughly like a cylinder 3 
meters in diameter. Their viewing angles were chosen heuristically to maximize 
viewing coverage and to minimize instances of cameras seeing each other’s lenses. 
The capture volume itself is (75cm) 3 , and can accommodate creatures that stay 
within the space (Figure 5). Our subjects often required human proximity and 
were too heavy for our transparent flooring, so we were only able to leverage a 
subset of the cameras present. 

With this setup, we are able to obtain video from a dome of inward facing, 
calibrated and synchronized cameras [29,6]. This allowed us to employ the Gen- 
eralized Voxel Carving (GVC) algorithm of [11]. Their system functions as a hy- 
brid form of wide-baseline stereo and voxel-carving, enabling the resulting voxel 
model to reflect concavities found on parts of the subject’s surface. Each second 
of multi- view footage produces 30 voxel models similar to the system of [26]. 

5.1 Real Subjects 

Baby: The baby data is the result of filming an 11-month old infant using nine 
cameras. The sequence is 45 frames long because that was the speed with which 
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Fig. 5. Our Capture Setup: Twenty video cameras were attached to an aluminum 
exoskeleton shaped roughly like a cylinder 3 meters in diameter. Their viewing angles 
were chosen heuristically to maximize viewing coverage of subjects raised in the middle, 
and to minimize instances of cameras seeing each other’s lenses. The capture volume 
itself is (75cm) 3 . 




Fig. 6. (A) Baby dataset: From left to right, one of the views, voxels, polygonal 
model, level sets, and skeleton with distance function. (B) Dog dataset: subject, 
polygonal model, distance function, level sets, and resulting spine. (C) Camel Puppet 
dataset: one view, wireframe, distance function, level sets, and resulting spine. 



she crawled down the length of the stage. Her progress forward is mostly due to 
her arms and right leg, while she tends to drag her left leg which causes frequent 
merging of her voxel-model from the waist down. The spine generation models 
her head and arms very consistently, but the correspondence tracker cannot 
resolve her legs and mis-assigns one leg or the other for the majority of frames. 

Dog: The dog was the most challenging of our test-subjects simply because 
we had only seven cameras that could operate without also filming the dog’s 
handlers. The volume reconstructions are all close to their average of 1.04M 
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voxels. Examination of the polygonal-mesh sequence reveals that much of this 
bulk comes from the ghost-voxels under his stomach that were carved successfully 
in the previous and subsequent test subjects when more cameras were running. 

Camel Puppet: The camel marionette, pictured in Figure 6C, is 26 cm long 
and stretches to a height of 42 cm. While the subject didn’t change in volume 
throughout shooting, its representation varied throughout the sequence between 
600k and 800k voxels, largely due to self-occlusions. The polygonal representa- 
tions averaged 200k polygons. The sequence has 495 frames, and was filmed using 
12 color cameras. The camel’s motion changes in the sequence from leg-jostling 
at the start to vigorous kicking and raising of the neck by the end. Our system 
was only hindered by the occasional “merging” of legs as they tucked underneath 
or appeared close enough to each other to be joined in the voxel stage. With 
mostly good frames, the exploratory skeleton-generation fed the correspondence 
tracker, which in turn determined that there were five limbs. The resulting crea- 
ture spine is pictured in Figure 4B. As illustrated, the correspondence tracking 
balances out the greedy limb inclusion of the exploratory skeletons. The online 
video also demonstrates this. 

The average processing times for skeleton-generation using our unoptimized 
implementation of the algorithms were consistently under four minutes per mesh 
on a Pentium 4 PC with one or more GB of memory. The correspondence- 
tracking portion of our algorithm (Section 4.2) took ten minutes on our 495 
frame camel sequence, and less than three minutes on all our other sequences. 
The preprocessing stage leading to input meshes is an implementation of GVC 
that adds approximately 12 minutes to each frame, with 3-8 seconds for Marching 
Cubes. GVC is not part of our contribution, and can be exchanged for other dense 
stereo or silhouette-carving algorithms, some of which may, though we have not 
yet tested this, have superior run-time performance without impacting quality. 
We have data of other example subjects that will be posted on our website, and 
the volumetric data has already been shared with other researchers. 
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Sequence of volume images of camel: Frame number 



Fig. 7 . Number of skeleton tips found per- frame during greedy search. 
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6 Conclusion and Future Work 

We have proposed spines as a novel 3D spatio-temporal representation for se- 
quences of volume images. This shape and motion descriptor introduces a method 
for imposing temporal correspondence on limb topologies when dealing with ar- 
ticulated subjects. We also present an algorithm for efficiently extracting branch- 
ing spines from surface data. Finally, we have presented example data where the 
temporally integrated canonical graph improves the quality of individual skele- 
tons. 

Where the current fully bottom-up work leaves off, extensions are planned 
that will allow a prior skeleton estimate to be forced on the data. This will 
especially apply to meshes where the limbs tuck in or become genus 1+. While 
the current results reflect that fairly noisy data, without priors, still reveals the 
real end effectors and underlying structure, further work is needed to track pose 
even in very poor data. 
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Abstract. Our purpose is to provide an augmented reality system for 
Radio-Frequency guidance that could superimpose a 3D model of the 
liver, its vessels and tumors (reconstructed from CT images) on external 
video images of the patient. In this paper, we point out that clinical us- 
ability not only need the best affordable registration accuracy, but also 
a certification that the required accuracy is met, since clinical condi- 
tions change from one intervention to the other. Beginning by address- 
ing accuracy performances, we show that a 3D/2D registration based 
on radio-opaque fiducials is more adapted to our application constraints 
than other methods. Then, we outline a lack in their statistical assump- 
tions which leads us to the derivation of a new extended 3D/2D criterion. 
Careful validation experiments on real data show that an accuracy of 2 
mm can be achieved in clinically relevant conditions, and that our new 
criterion is up to 9% more accurate, while keeping a computation time 
compatible with real-time at 20 to 40 Hz. 

After the fulfillment of our statistical hypotheses, we turn to safety issues. 
Propagating the data noise through both our criterion and the classical 
one, we obtain an explicit formulation of the registration error. As the 
real conditions do not always fit the theory, it is critical to validate 
our prediction with real data. Thus, we perform a rigorous incremental 
validation of each assumption using successively: synthetic data, real 
video images of a precisely known object, and finally real CT and video 
images of a soft phantom. Results point out that our error prediction is 
fully valid in our application range. Eventually, we provide an accurate 
Augmented Reality guidance system that allows the automatic detection 
of potentially inaccurate guidance. 



1 Introduction 

The treatment of liver tumors by Radio-Frequency (RF) ablation is a new tech- 
nique which begins to be widely used [11]. However, the guidance procedure to 
reach the tumors with the electrode is still made visually with per-operative 2D 
cross-sections of the patient using either Ultra-Sound (US) or Computed Tomog- 
raphy (CT) images. Our purpose is to build an augmented reality system that 
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could superimpose reconstructions of the 3D liver and tumors onto video images 
in order to improve the surgeon’s accuracy during the guidance step. According 
to surgeons, the overall accuracy of such a system has to be less than 5 mm to 
provide significant help. 

In our setup, a CT-scan of the patient is acquired just before the interven- 
tion (RF is a radiological act), and an automatic 3D-reconstructions of his skin, 
his liver and the tumors is performed [2]. Two cameras (jointly calibrated) are 
viewing the patient’s skin from two different points of view. The patient is intu- 
bated during the intervention, so the volume of gas in his lungs can be controlled 
and monitored. Then, it is possible to fix the volume at the same value during 
a few seconds repetitively and to perform the electrode’s manipulation almost 
in the same volume’s condition than the one obtained during the preliminary 
CT-scan. Balter [1] and Wong [20] indicates that the mean tumor repositioning 
at exhalation phase in a respiratory-gated radiotherapy context is under 1 mm. 
Thus, it is reasonable to assume that a rigid registration is sufficient to register 
accurately the 3D-model extracted from the CT with the 2D video images. 

Critical issues for computer-guided therapy systems are accuracy and reliabil- 
ity. Indeed, the surgeon has no other source of information than the augmented 
reality system during the guidance step: he has to rely fully on it. As many 
parameters can change from one intervention to the other (angle between the 
cameras, cameras focal, curvature of the patient abdomen), the accuracy pro- 
vided can sharply vary. For instance, in a point-based registration context, there 
can be a factor two on the accuracy when the cameras angle goes from 20° to 
60° [12]. In accordance with this fact, we cannot afford providing a system with- 
out assessing its accuracy during any possible intervention. Consequently, we 
need to tackle both the system accuracy and the capability to assess its value 
before the intervention. Moreover, every gain in accuracy may be exploited to re- 
lease some constraints in the system setup (position of the cameras, ergonomics, 
computation time...). 

To answer these requirements, we review in Section 2 the existing registration 
techniques, and we focus more particularly on 3D/2D points based methods. 
As their statistical assumptions are not fully satisfied in our application (our 
3D point measurements cannot be considered as noise-free), we derive a new 
criterion that extends the classical one. Experimental results on synthetic and 
phantom data show that it provides a registration up to 20% more accurate. To 
be able to quantify online this accuracy, we apply in Section 3 the general theory 
of error propagation to our new criterion and its standard version. This gives us 
an analytical formulation of the covariance matrix of the seeked transformation. 
But this is only the first part of the job: we then need to validate this prediction 
w.r.t. the statistical assumptions used to derive the theoretical formula (small 
non-linearity of the criterion, perfect calibration, unbiased Gaussian noise on 
points, etc.). Incremental tests with synthetic data, real cameras, and finally 
real data of a soft phantom, show that our prediction is reliable for our current 
setup, but may require the inclusion of calibration and skin motion errors if it 
was to become more accurate. 
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2 A New 3D/2D Point-Based Registration Criterion 

This section aims at finding the most accurate registration method for our ap- 
plication. 



2.1 Surfacic, Iconic, 3D/3D, or 3D/2D Registration? 

Surface and iconic registration using mutual information have been used to reg- 
ister the 3D surface of the face to either video images [19] or another 3D surface 
acquired with a laser range scanner [5]. In both cases, thanks to several highly 
curved parts on the model (nose, ears, eyes), the reported accuracy was under 5 
mm. We believe that in our case, the “cylindrical” shape of the human abdomen 
is likely to lead to much larger uncertainties along the cranio-caudal axis. 

Landmarks 3D/3D or 3D/2D registration can be performed when several 
precisely located points are visible both in the 3D-model and in the video im- 
ages. Since the landmarks are really homologous, the geometry of the underlying 
abdomen surface is not any more a problem. As there are no visible anatomical 
landmarks in our case, we chose to stick to the patient skin some radio-opaque 
markers that are currently localized interactively (an automatic segmentation is 
currently being tested) . The matching is performed thanks to epipolar geometry 
between video points, and using a prediction/verification (alignment) algorithm 
between video and CT points. 

As our system is based on a stereoscopic video acquisition, one could think of 
using a stereoscopic reconstruction. In our case, the main problem is the possible 
occlusion of some 2D points in one of the cameras, which would lead to discarding 
the information provided by this point in the other camera. Moreover, one would 
need to compute non-isotropic uncertainty of the reconstructed 3D points [8] to 
optimize a 3D/3D registration criterion fitting well the statistical assumptions. 
Thus, we believe that it is better to rely on LSQ 3D/2D registration criteria. 

The 3D/2D registration problem was largely considered in a wide variety of 
cases. Briefly, we can classify the different methods in 3 groups: closed-form, lin- 
ear and non-linear. The two first method classes were proposed in the last decades 
to find the registration as quickly as possible to fulfill real-time constraints [6, 
3,7]. However they are very sensitive to noise because they assume that data 
points are exact, contrary to non-linear method. Consequently, non-linear meth- 
ods provides better accuracy results [10,18]. As the accuracy is crucial in our 
application, we think that a LSQ criterion optimization has a definite advantage 
among the other methods because it can take into account the whole information 
provided by the data. However, all of the existing methods [4,9,15,10] implicitly 
consider that 2D points are noisy, but that 3D points of the model to register 
are exact. In our case, this assumption is definitely questionable, which lead to 
the development of a new maximum likelihood (ML) criterion generalizing the 
standard 3D/2D LSQ criterion. 
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2.2 Maximum Likelihood 3D/2D Registration 

Notations. Let Mi (i E {1 • • - TV}) be the 3D points that represent the exact 
localization of the radio-opaque fiducials in the CT-scan reference frame and 
mi® be the 2D points that represent its exact position in the images of camera 
(/). To account for occlusion, we use a binary variable £] equal to 1 if Mi is 
observed in camera (/) and 0 otherwise. We denote by < -|- > the cross-products, 
by T * M the action of the rigid transformation T on the 3D point M and by 
Pi (1 < l < S) the camera’s projective functions from 3D to 2D such that 
m^P = P^)(T ★ Mi). In the following sections, A will represent an estimation of 
a perfect data A , and A will represent an observed measure of a perfect data A. 



Standard Projective Points Correspondences (SPPC) Criterion. As- 
suming that the 3D points are exact (Mi = Mi) and that the 2D points only are 
corrupted by an isotropic Gaussian noise rji of variance g\ d , the probability of 
measuring the projection of the 3D point Mi at the location m^® in image (/), 
knowing the transformation parameters 0 = {T} is given by: 



| 9) = 



2-kctId 



exp - 



2 • <T2D 2 



Let x be the data vector regrouping all the measurements, in this case the 2D 
points 7?b® only. Since the detection of each point is performed independently, 
the probability of the observed data is p(x \ 0) = Yli=i YliLiP^i^ I In 
this formula, unobserved 2D points (for which £? = 0) are implicitly taken out of 
the probability. Now, the Maximum likelihood transformation 0 maximizes the 
probability of the observed data, or equivalently, minimizes its negative log: 
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Thus, up to a constant factor, this ML estimation boils down to the classical 
3D/2D points LSQ criterion. 



Extended Projective Points Correspondences (EPPC) Criterion. To 

introduce a more realistic statistical hypothesis on the 3D data, it is thus safer 
to consider that we are measuring a noisy version of the exact points: Mi = 
Mi + £i with Si ~ iV(0, cr 3D ). 

In this case, the exact location Mi of the 3D points is considered as a pa- 
rameter, just as the transformation T. In statistics, this is called a latent or 
hidden variable , while it is better known as an auxiliary variable in computer 
vision. Thus, knowing the parameters 0 = {T, Mi, . . . the probability of 

measuring respectively a 2D and a 3D point is: 

P(fhf > | 9) = G a2D (p(*> (T * Mi) - rhf < >) and p(M< | 9) = G asD (. - M<) . 
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One important feature of this statistical modeling is that we can safely assume 
that all 3D and 2D measurements are independent. Thus, we can write the prob- 
ability of our observation vector x = (j^i) ■•••> fnjy, •••? , •••, ra^-, Mi, M^) as 

the product of the above individual probabilities. The ML estimation of the 
parameters is still given by the minimization of — log(p(x|0)): 



N 
2 = 1 



II Mi — Mj || 2 ff , ||mf ) -P(0(T*M i )|| 2 

2 ' CFsd 2 j 2 2-CT2D 2 

1=1 i=l 



where K is a normalization constant depending on & 2 D and cf^d- The convergence 
is insured since we minimize the same positive criterion at each step. 

The obvious difference between this criterion and the simple 2D ML is that 
we now have to solve for the hidden variables (the exact locations Mi) in addition 
to the previous rigid transformation parameters. An obvious choice to modify 
the optimization algorithm is to perform an alternated minimization w.r.t. the 
two groups of variables, starting from a transformation initialization T 0 , and an 
initialization of the Mi with the Mi. The algorithm is stopped when the distance 
between the last two estimates of the transformation become negligible. 



Discussion. We highlight in [12] that the EPPC can be viewed as a gener- 
alization of either the standard criterion (when a^D 0), or a stereoscopic 
points reconstruction followed by a 3D/3D registration (if a^D is largely over- 
estimated). A quantitative study on synthetic data showed that accuracy gain 
brought by EPPC depends essentially on the angle between the cameras and 
ratio of 2D and 3D SNR [12]. For instance, with data simulating our clinical 
conditions, EPPC brings up to 10% gain accuracy if the cameras angle is 50° 
and 18% if the angle is 20°. Finally, as simulation does not take into account 
calibration errors and possible noise modeling errors, we made a careful valida- 
tion on real data from a phantom. This showed that a mean accuracy of 2 mm 
can be reached with a maximum error of 4 mm (obtained when the parameters 
configuration are not optimal: weak angle between the cameras and/or markers 
occlusion) and that we can rely on an accuracy gain of 9% with computations 
time that can still fulfill real-time constraints. 



3 Theoretical Uncertainty and Prediction Validation 

Now that we have provided a criterion that perfectly fulfills the statistical condi- 
tions of our application, we still face the problem of the varying accuracy w.r.t. 
the various system parameters. In order to propose a safe product to radiologists, 
we should provide a statistical study that would give the mean Target Registra- 
tion Error (TRE) w.r.t. the number of markers, the angle between the cameras, 
the focus, and the relative position of the target w.r.t. the markers. This is the 
equivalent of the direction for use and the secondary effects list mandatory for 
all proposed drugs in the therapeutic field, and the reliability and accuracy ta- 
bles of robotics tools: these tables give a usability range to assess under which 
condition a particular feature (for example accuracy) could be reached. 




84 



S. Nicolau et al. 



As increasing the number of experiments is very expensive and time- 
consuming, it is almost infeasible to measure the accuracy provided for each 
experimental condition. Moreover, as we want a real-time system, the condi- 
tions may change during the operation (e.g. markers can be occluded by the 
radiologist), and the accuracy assessment has to be constantly updated to avoid 
a potentially dangerous gesture. Consequently, we think that predicting the TRE 
by studying the theoretical noise propagation is the best way to ensure the safety 
of our system. 

3.1 Uncertainty Propagation through SPPC and EPPC Criteria 

In the sequel, we firstly remind the general covariance propagation theory 
through a criterion. Then, following the methodological framework introduced 
in [14,13], we present for SPPC and EPPC analytical formulations of the trans- 
formation covariance matrix. 



General Theory of Error Propagation. Let the criterion C(x, 0) be a 
smooth function of the data vector y and the parameters 0. We are looking for the 
optimal parameter vector 0 = argmin 0 (C(y, 0 )). A local minima is reached and 

well defined if <£(y, 0) = (f§-(x, 0)) T =0 and H = §^r(x> 0) is positive definite. 
The function defines 0 as an implicit function of y. A Taylor expansion gives: 

*(x + Sx,0 + 59) = *(*, 9) + §§ • <5 X + ff • 59 + 0(5 X 2 , 59 2 ) 

which means that around an optimum 0 we have: 

0(x + 5x) = 0(x) - (If) (_1) -H-5X + 0(5 X 2 ) 

Thus, if x is a random vector of mean y and covariance I7 XX , the optimal vector 
0 is (up to the second order), a random vector with mean 0 = arg min^C^y, 0 )) 

and covariance Uqo = £ X x . Thus, to propagate the co- 

variance matrix from the data to the parameters optimizing the criterion, we 
need to compute H = 9 C q^ , J$ x = 9 and r = J& x • I7 XX • Jj^. 



SPPC Transformation Covariance. Our analytical analysis needs the block- 
decomposition of the 3x4 projection matrix 3 as shown below: 
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so that m| i) =p( , )(T*M i ) 
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The second order derivatives H and Jj are computed using the chain rule, 
and after some calculations, the uncertainty of the transformation may be sum- 
marized as Utt = H 1 • r • H 1 with 

r = 'E? =1 D i T (o$ D -K i -K i +L i )-D i and H = A T • A • A 
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where D, = 
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EPPC Transformation Covariance. For this case the calculations are not 
usual because the vector of sought parameters is 0 = (T, Mi • • • Mat) so that: 
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Since we only focus on the covariance of the transformation T alone, we need 

Utt 2Jtm 

^ ^MT _ 

inversion, and after long calculations, we end up with E T t = R~ x • Q • H~ x 
where: 



to extract Ett from Eqq = 



. This is done using a block matrix 



fi = Ef=iA T (4«+^) 1 -(al D Id+Kr'.L i -K- 1 )-(al D Id+K7') 1 ■ D r 

H = £ti Di T • (<t | D Id + K^y 1 ■ D„ 

One can check that for the limit case where a^D = 0, the transformation 
uncertainty given by Ett is equal for both criteria. 



Target Registration Error (TRE). Finally, to obtain the final covariance 
matrix on a target point Ci after registration, we simply have to propagate the 

uncertainty through the transformation action: Et*Ci = ’ ^ TT ' ' 



3.2 Validation of the Prediction 

With the previous formulas, we are able to predict the accuracy of the transfor- 
mation after the convergence of the algorithm of Section 2. But this is only one 
part of the job: we now have to validate the statistical assumptions used to derive 
the theoretical formula (small non-linearity of the criterion, perfect calibration, 
unbiased Gaussian noise on points, etc.). The goal of this section is to verify 
incrementally that these assumptions hold within our application domain. This 
will be done using synthetic data (for the non-linearities of the criterion), real 
video images of a precisely defined 3D object (for camera calibration and distor- 
tions), and finally real CT and video images of a soft phantom of the abdomen 
(for noise assumptions on point measurements). 



Synthetic Data. Experiments are realized with two synthetic cameras jointly 
calibrated with a uniform angle from 5° to 120°, and focusing on 7 to 25 points 
Mi randomly distributed in a volume of about 30 x 30 x 30 cm 3 . The cameras 
are located at a distance of the object of 20 to 50 times the focal length. We add 
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Table 1 . Validation of the uncertainty prediction with 20000 registrations. 





Mean p 2 (3.0) 


Var p 2 (6.0) 


KS-test (p > 0.01) 


SPPC 


3.020 


6.28 


0.353 


EPPC 


3.016 


6.18 


0.647 



to the 2 D and 3D points a Gaussian noise with a varying from 0.5 to 4.0 (which 
corresponds to a SNR of 60 dB to 90 dB 1 ). The registration error is evaluated 
using control points C{ to assess a TRE instead of a Fiducial Localization Error 
(FLE). 

Since each experiment is different, we need to evaluate the relative fit of 
the Predicted TRE (PTRE) vs. the Experimental TRE (ETRE) to quantita- 
tively measure the quality of the uncertainty prediction. Due to the signifi- 
cant anisotropy, we did not use the basic ratio ETRE 2 /PTRE 2 , but rather 
the validation index [14], which weights the observed error vector with the in- 
verse of its predicted covariance matrix to yield a Mahalanobis distance p 2 . As- 
suming a Gaussian error on test points after registration, this validation index 
should follows a xi law. Repeating this experiment with many different “param- 
eters” configurations, we can verify that pi 2 is actually xi distributed using the 
Kolmogorov- Smirnov (K-S) test [16]. We also verify that the empirical mean and 
variance matches the theoretical ones (resp. 3 and 6 for a xi distribution). 

Table 1 summarizes the statistics obtained for 20000 registrations where all 
the parameters randomly vary as previously described. The values obtained for 
both the validation index and the KS-test fully validate the reliability of the 
transformation’s accuracy prediction. 



Real Calibration and Synthetic Noise. The perfect validation of our accu- 
racy prediction on synthetic data does not take into account possible calibration 
errors of the cam eras and excludes likely distortions from the pinhole model. 
The goal of this experiment is to address the validity of these assumptions us- 
ing a real video system. We used a 54 points calibration grid that allows for a 
very accurate detection of the points (oxn <0.1 mm, 02 D < 0.2 pixel). Such 
an accuracy is obviously far below the current detection of real markers posi- 
tions (cr 2 D — 2 pixel, < 73 ~ 1 mm). To simulate the range of variability of our 
application, we still add a Gaussian noise on the collected data points. 

Ideally, the Experimental TRE should be assessed by comparing each reg- 
istration result with a gold-standard that relates both the CT and the camera 
coordinate systems to the same physical space, using an external and highly ac- 
curate apparatus. As such a system is not available, we adapted the registration 
loops protocol introduced in [13,17], that enables to measure the TRE error for 
a given set of test points. 

The principle is to acquire several couples of 2D images with jointly calibrated 
cameras so that we can compare independent 3D/2D registration of the same 
object (different 2D and 3D images) using a statistical Mahalanobis distance p 2 . 

SNRdB — 10 log 10 (f*) where a s (resp. cr n ) is the variance of the signal (resp. noise). 
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Fig. 1 . Registration loops used to estimated the registration consistency: a test point 
C chosen at a certain distance of the printed grid (typically 20 cm) is transformed into 
the CAM1 coordinate system using a first 3D/2D registration Ti, then back into the 
grid coordinate system using a second 3D/2D registration Th provided by the other 
couple of cameras (the coordinate system of CAM1 and CAM2 are identical since 
cameras are jointly calibrated). If all transformations were exact, we would obtain the 
same position for the test point. Of course, since the transformations are not perfect, 
we measure an error which variance <7 2 i oop = 2a 2 cam /grid corresponds to a TRE. In 
fact, to take into account anisotropies we compute a covariance matrix and a statistical 
Mahalanobis distance /j 2 between C and T\ *T 2 _1 ★ C. 

A typical loop, sketched in Fig. 1, described the method to get a /i 2 -value. This 
experiment providing only one error measurement, we still need to repeat it 
with different datasets to obtain statistically significant measures. In order to 
take into account possible calibration error and/or bias, it is necessary to change 
the cameras calibrations and positions, and not only to move the object in the 
physical space. Likewise, to decorrelate the two 3D/2D transformations, we need 
to use two differently noised 3D data sets. Indeed, when using the same set of 
3D points to register the 2D points, the error on 3D points similarly affects 
both transformations, and the variability of the 3D points extraction (and any 
possible bias) is hidden. 

Finally, varying each set of parameters (different configuration of our four 
cameras, different positions/orientations of the calibration grid), we got 144 fi 2 - 
values. The cameras were placed 10° to 60° apart, at a distance of the object of 
25 to 30 times the focal length. Figures 2 shows the mean, standard deviation 
and K-S test value of the validation index w.r.t. the number of points used 








88 



S. Nicolau et al. 



SPPC EPPC 




O 10 20 30 40 50 O 10 20 30 40 50 




Fig. 2. Validation of the uncertainty prediction on the calibration grid w.r.t. the num- 
ber of points used for the registration. Top: mean and standard deviation of the vali- 
dation index. Bottom: KS confidence. Higher scores are more confident. 



(randomly chosen among the 54 available). One can see that the prediction is 
correct up to 40 points (which spans our range of application). This critical value 
is due to the progressive reduction of the registration error that finally meets 
the ignored calibration error (about 0.5 mm). Likewise, we observed the same 
behavior when the feature noise becomes too small {g^d and g^d below 0.7). 



Real Data (Phantom). To test the last assumption of our prediction (unbiased 
Gaussian noise), we now turn to a validation experiment on real 3D and 2D 
measurements of a plastic phantom (designed in [12]), on which are stick about 
40 radio-opaque markers (see the incrusted top left image in Fig. 4). The set up 
is almost the same as for the previous calibration grid (for further details see 
[12]). However, target points C{ are now randomly chosen within the phantom 
liver, and markers in the CT and on the video images are interactively localized. 

The markers used were randomly chosen among the 40 available, and we 
obtained 80 /i 2 -values for each experiment. As we experimentally observed that 
there was a consistent but non-rigid motion of the soft skin (about 1mm), we 
chose gsd — 2.0 mm (instead of 1 mm) to take into account this additional 
uncertainty. Figure 3 presents the mean and variance of fj? w.r.t. the number of 
points. Firstly, we notice that the mean value slowly increases with the number of 
points. This can be explained by the biases introduced by the calibration error 
and the correlated motion of the markers on the skin. Indeed, the measured 
accuracy figures do not converge to 0 mm with a large number of points but 
rather towards 1 mm, which corresponds to the motion of the skin. 

Nevertheless, it appears that the prediction is well validated for a range of 15 
to 25 points. As p 2 can be interpreted as a relative error or the error prediction 
(see [14]), Fig. 3 shows that we over-estimate the mean TRE by a factor 1.7 for 
a small number of points (p 2 — 1 ), and that we under-estimate it by a factor of 
1.3 for more than 25 points (p 2 ~ 5). For our application, in which the number 
of visible points should not exceed 20, this means that we predict correctly the 
amplitude of the error on the transformation. In the worst case, we over-estimate 
it, which can be considered as a good safety measure. One can visually assess 
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Fig. 3. Validation of the uncertainty prediction on the phantom w.r.t. the number of 
points used for the registration. Top: mean and standard deviation of the validation 
index. Bottom: KS confidence. Higher scores are more confident. 




Fig. 4. Left image: the top image shows the phantom with radio-opaque markers on 
its skin. The main image shows the phantom without its skin and we can see the radio- 
opaque markers on the fake liver. Right image: we superimpose the reconstructions of 
the fiducials whose predicted accuracy is around 2 mm. One can visually assess the 
quality of the registration. 



the validation of our prediction error on one case among the 160 registrations 
we performed (Fig. 4). 

4 Conclusion 

We devised in this paper an augmented reality system for Radio-Frequency ab- 
lation guidance based on a new 3D/2D registration criterion with a validated 
error prediction. We argue the necessity to provide not only the best afford- 
able registration accuracy but also an accurate assessment of the TRE for safety 
consideration. 

To reach the best accuracy performances, we firstly derived a new 3D/2D 
Maximum Likelihood registration criterion (EPPC) based on better adapted 
statistical hypotheses than the classical 3D/2D least-square registration criterion 
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(SPPC). Experiments on real data showed that EPPC provides an accuracy of 
about 2mm within the liver, which fits the initial requirements of less than 5mm. 
Moreover, EPPC is up to 9% more accurate than SPPC with a refreshment rate 
that can reach real-time constraints. We underline an alternative interpretation 
of this gain: we can typically reach the same accuracy with 20 markers for EPPC 
where 24 are needed for SPPC. As we face possibilities of markers occlusion 
because of the surgeon’s hand and cumbersomeness constraints on the placement 
of the markers, this gain should not be taken with the light one. In addition, 
as clinical conditions do not allow a free camera positioning, we could meet 
situation where an angle between the cameras could decrease below 20°, which 
would mean an accuracy gain of 18%. 

In order to assess the system accuracy for all configurations, we propose in a 
second step a theoretical propagation of the target covariance through SPPC and 
EPPC w.r.t the experimental configuration parameters. To verify the validity of 
all the assumptions of that method, we conducted a careful validation study that 
assess in turn the range of validity of each hypothesis. We firstly verified that 
non-linearities in the criterion and calibration error are negligible. Then, we use 
a realistic phantom with a soft and deformable skin to validate the prediction in 
the range of our application (i.e. for 15 and 25 markers). This study confirmed 
that we correctly predict the registration error, with a slight over-estimation if 
too much markers are occluded, which is a good safety rule. 

To reach the clinical usability, the whole system still has to be validated 
on real patients. We are currently conducting experiments (using repeated CT 
scans at the same point of the breathing cycle) to certify that the motion of the 
internal structures due to the monitored breathing of the patient cannot bias 
our accuracy prediction. Preliminary results indicates that this motion is of the 
order of 1 mm, which is in accordance with the motions we experienced because 
of the phantom soft skin. Thus, we are pretty confident that our registration 
error prediction will work properly in the final system. Last but not least, it 
is possible to estimate broadly the TRE before scanning the patient, by using 
the stereoscopic reconstruction of the markers instead of their positions in the 
scanner. This will allow a better control of the external conditions (number of 
markers, angle between the cameras) and the optimization of the intervention 
preparation. 
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Abstract. There has been much effort invested in increasing the robust- 
ness of human body tracking by incorporating motion models. Most ap- 
proaches are probabilistic in nature and seek to avoid becoming trapped 
into local minima by considering multiple hypotheses, which typically 
requires exponentially large amounts of computation as the number of 
degrees of freedom increases. 

By contrast, in this paper, we use temporal motion models based on Prin- 
cipal Component Analysis to formulate the tracking problem as one of 
minimizing differentiable objective functions. The differential structure 
of these functions is rich enough to yield good convergence properties 
using a deterministic optimization scheme at a much reduced compu- 
tational cost. Furthermore, by using a multi-activity database, we can 
partially overcome one of the major limitations of approaches that rely 
on motion models, namely the fact they are limited to one single type of 
motion. 

We will demonstrate the effectiveness of the proposed approach by using 
it to fit full-body models to stereo data of people walking and running 
and whose quality is too low to yield satisfactory results without motion 
models. 



1 Introduction 

In recent years, much work has been devoted to increasing the robustness of 
people tracking algorithms by introducing motion models. Most approaches rely 
on probabilistic methods, such as the popular CONDENSATION algorithm [1, 
2], to perform the tracking. While effective, such probabilistic approaches require 
exponentially large amounts of computation as the number of degrees of freedom 
in the model increases, and can easily become trapped into local minima unless 
great care is taken to avoid them [3, 4, 5, 6]. 

By contrast, in this paper, we use temporal motion models based on Principal 
Component Analysis (PCA) and inspired by those proposed in [7,8,9] to formu- 
late the tracking problem as one of minimizing differentiable objective functions. 
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Our experiments show that the differential structure of these objective func- 
tions is rich enough to take advantage of standard deterministic optimization 
methods [10], whose computational requirements are much smaller than those of 
probabilistic ones and can nevertheless yield very good results even in difficult sit- 
uations. Furthermore, in practice, we could combine both kinds of approaches [5]. 

We will further argue that we can partially overcome one of the major limita- 
tions of approaches that rely on motion- models, namely that they limit the algo- 
rithms to the particular class of motion from which the models have been created. 
This is achieved by performing PC A on motion databases that contain multiple 
classes of motions as opposed to a single one, which yields a decomposition in 
which the first few components can be used to classify the motion and can evolve 
during tracking to model the transition from one kind of motion to another. 

We will demonstrate the effectiveness of the proposed approach by using it 
to fit full-body models to stereo data of people walking and running and whose 
quality is too low to yield satisfactory results without models. This stereo data 
simply provides us with a convenient way to show that this approach performs 
well on real data. However, any motion tracking algorithm that relies on minimiz- 
ing an objective function is amenable to the treatment we propose. We therefore 
view the contribution of this paper as the proposed formulation that produces 
results using a deterministic, as opposed to probabilistic optimization method, 
which yields good performance at a reduced computational cost. 

In the remainder of this paper, we first discuss related approaches and our 
approach to body and motion modeling. We then introduce our deterministic 
optimization scheme and show its effectiveness using real data. 

2 Related Work 

Modeling the human body and its motion is attracting enormous interest in the 
Computer Vision community, as attested by recent and lengthy surveys [11,12]. 
However, existing techniques remain fairly brittle for many reasons: Humans 
have a complex articulated geometry overlaid with deformable tissues, skin and 
loosely-attached clothing. They move constantly, and their motion is often rapid, 
complex and self-occluding. Furthermore, the 3-D body pose is only partially 
recoverable from its projection in one single image. Reliable 3-D motion analysis 
therefore requires reliable tracking across frames, which is difficult because of the 
poor quality of image-data and frequent occlusions. 

When a person is known a priori to be performing a given activity, such 
as walking or running, an effective means to constrain the search and increase 
robustness is to introduce a motion model. Of particular interest to us are models 
that represent motion vectors as linear sums of principal components and have 
become widely accepted in the Computer Animation community as providing 
realistic results [13,14,15]. The PC A components are computed by capturing as 
many people as possible performing a specific activity, for example by means 
of an optical motion capture system, representing each motion as a temporally 
quantized vector of joint angles, and performing a Principal Component Analysis 
on the resulting set of vectors. 
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In practice, the position of a person, or body pose, in a given image frame 
can be defined by the position and orientation of a root node and a vector of 
joint angles. A motion can then be represented by an angular motion vector , 
that is a set of such joint angle vectors measured at regularly sampled intervals. 
Given a large enough database of motion vectors for different motion classes and 
the corresponding principal components <9 j, 1 < j < m, at a given time £, the 
joint angle vector (9 (/it) can then be written as 

m 

O(ih) = O 0 (nt) 4 ^ajOjifn) with 0 < /i t < 1 , (1) 

i = 1 

where /it is a normalized temporal variable that indicates to what stage of the 
motion the pose corresponds, (9o represents an average motion, and the oq are 
scalar coefficients. In short, the vector (/it, o 1 , ..., <a m ), where m is much smaller 
than the number of joint angles, can be used as the state vector that completely 
describes the body pose. Recovering this pose then amounts to minimizing an 
image-based objective function F with respect to this more compact represen- 
tation, and can be expected to be much more robust than minimizing it with 
respect to the full set of joint angles. 

This representation has already been successfully used in our community, 
but almost always in a statistical context [7,8,9] and without exploiting the 
fact that F is easily differentiable with respect to /it and the oq coefficients of 
Eq. 1. Here, we propose to use this fact to formulate the fitting problem as a 
traditional optimization problem with respect to the (/it, aq, ..., a/v) state vector. 
Instead of generating many “particles” by randomly choosing values for the aq, 
we will compute the Jacobian of F and use it in conjunction with standard 
least-squares techniques [16]. Our deterministic approach to motion tracking is 
therefore related to an earlier technique [17] that also uses PC A to model the 
set of 2-D flow vectors that can be seen in video-sequences of a walking subject 
and to recognize specific 2-D poses without requiring a probabilistic framework. 
However, this approach relies on an initial segmentation of the body parts and is 
viewpoint dependent. By contrast, we fit a global 3-D model to the whole body, 
which lets us fit over a whole sequence and recover accurate 3-D poses. 

3 Models 

In this section, we introduce the models we use to describe both body pose and 
shape at a given time as well as its motion over time. 

3.1 Body Model 

In earlier work [18], we have developed a body- modeling framework that relies on 
attaching implicit surfaces, also known as soft objects, to an articulated skeleton. 
Each primitive defines a field function and the skin is taken to be a level set of the 
sum of these fields, as shown in Fig. la. Defining surfaces in this manner lets us 
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Fig. 1 . Shape and motion models, a) Volumetric primitives attached to an articulated 
skeleton, b) First two PCA components for 4 different captures of 4 subjects walking 
at speeds varying from 3 to 7km/h, and running at speeds ranging from 6 to 12km/h. 
The data corresponding to different subjects is shown in different styles, c) Percentage 
of the database that can be generated with a given number of eigenvectors. 



define a distance function of data points to the model that is differentiable. We 
will take advantage of this to implement our minimization scheme, as discussed 
in Section 4. 

As in Section 2, let us assume that, at a given time, the pose of the skeleton 
is entirely characterized by the global position and orientation G of a root node 
and a set of joint angles 0. To avoid undue blending of primitives, the body is 
divided into several body parts. Each body part b includes n b ellipsoidal prim- 
itives attached to the skeleton. To each primitive is associated a field function 
fi of the form fi(G,0,X) = exp(— a^(G, (9, X)), where X is a 3-D point, 

di, bi are constant values, and di is the algebraic distance to this ellipsoid. The 
complete field function for body part b is taken to be 

n b 

f b (G,O,X)=J2fi(X,G,0) , (2) 

i= 1 

and the skin is the set <S(G, 0) = [j b {X G 5i 3 |/ 6 (G, (9, X) = G}, where C is a 
constant. A point X is said attached to body part b if 

f b (G, 0, X) = min | f(G, 0, X) - C\ (3) 

l<i<B 

Fitting the model to stereo-data acquired at time t then amounts to minimizing 

B 

F t (G t ,0 t ) = Y J J2F(G t ,0t,X t )-Cf * (4) 

6=1 X t eb 

where the X t are the 3-D points derived from the data, each one being attached 
to one of the B body parts. Note that F t is a differentiable function of the global 
position Gt and of the joint angles in 0 t and that its derivatives can be computed 
fast [18]. 
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3.2 Motion Models 



To create a motion database, we used a Vicon tm optical motion capture system 
and a treadmill to capture 4 people, 2 men and 2 women, 

— walking at 9 different speeds ranging from 3 to 7 km/h, by increments of 0.5 
km/h; 

— running at 7 different speeds ranging from 6 to 12 km/h, by increments of 
1.0 km/h. 



The data was then segmented into cycles and normalized so that each one is 
represented by the same number of samples. To this end, spherical interpolation 
in quaternion space was used because it is the space in which a distance mea- 
suring the proximity of two orientations can be naturally defined. It therefore 
lets us interpolate with a meaningful angular velocity measure on an optimal 
path splining among orientation key frames [19]. Since people never perform the 
same motion twice in exactly the same fashion, we included in the database four 
walking or running cycles for each person and speed. The mean motion of the set 
of examples was subtracted and the M eigenvectors of Eq. 1 were obtained by 
SVD. Retaining only m < M eigenvectors, gave us a reduced base of the most 
significant subspace of the motion space, that is the one that contains cr % of 
the database, where A i is the i-th bigger eigenvalue. 



a = 




(5) 



In our experiments, we chose a = 0.9, which means that for the multi- activity 
database we need only 5 out of 256 coefficients, which corresponds to the total 
number of examples in the database. In Fig. lc we display a as a function of the 
number of eigenvectors. The same method was used for the walking and running 
databases independently. The estimation problem is thus reduced from the ~ 80 
degrees of freedom for the 28 joints in our body model at each time step, to 5 
coefficients plus the time. 

Fig. lb shows the first two PC A components of the original examples used 
to create the joint walking and running database. The two activities produce 
separate clusters. The walking components appear on the left of the plot and 
form a relatively dense set. By contrast, running components are sparser because 
inter-subject variation is larger, indicating that more examples are required for 
a complete database. 

Note that varying only the first two components along the curve correspond- 
ing to the path from one subset to another, yields very natural transitions be- 
tween walking and running motions. 



4 Deterministic Approach to Tracking 

In this section we introduce our deterministic approach to tracking that relies 
on describing motion as a linear combination of the motion eigenvectors of Sec- 
tion 3.2 and choosing optimal weights for these vectors. As before, we represent 
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the angular component of motion 0 as 0 = 0q + Y^iL l where 0q is the 

average motion and the 0i are the eigenvectors of Section 3.2. Evaluating 0 at 
a particular time /it yields the pose 

rn 

O(Vt) = 6>o(Mt) = [e 1 ^),-- ■ ,e ndof (nt)} T , (6) 



where the 0 J are the actual joint angles at time /i t for the ndof degrees of freedom 
of the body model we use. 

Note that the complete motion is described not only by the angular motion 
discussed above, but also by the motion G t of the root body model node with 
respect to which the angles are expressed. This adds six degrees of freedom to 
our model, which are not represented at all in our motion database since the data 
was acquired on a treadmill on which the subjects were forced to walk straight. 
Furthermore, even if the global motion had been acquired, it would make no 
sense to include it in the database because similar motions would then have been 
considered as different just because of the orientation or position of the body. 

Let us assume that we have acquired image data, which here is the stereo 
data depicted by Fig. 2, but could just as well be anything else, in T consecutive 
frames. Our goal is to recover the motion by minimizing an objective function 
F over all frames and, therefore, fitting the model to the image data. Tracking 
is achieved in two main steps. First the global motion G t is recovered in a 
recursive way. Results from frame t are used as initialization for frame t + 1. We 
initialize using the average motion (9 q, positioning the global motion for the first 




Fig. 2. Input stereo data. Top row: First image of a synchronized trinocular video 
sequence at three different times. The 3-D points computed by the Digiclops trn system 
are reprojected onto the images. Bottom row: Side views of these 3-D points. Note that 
they are very noisy and lack depth because of the low quality of the video sequence. 
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frame by hand, where the time /q, 1 < t < T, is a linear interpolation between 
initial values /i i and ht for the first and last frames. For each frame we minimize 
F t (Gt, 0o(/R, a i)) with respect to G = G(t x ,t y , t z ,0 x ,6 y ,6 z ), where F t is defined 
in equation 4. Given this global motion estimate, we then fit the data over all 
frames simultaneously by minimizing F with respect to the /it, cq and G t \ 

F= F t (G t ,e(lH,<*i)) , (7) 

1 <t<T 

In the remainder of this Section, for comparison purposes, we first show 
the result of fitting the stereo data we use without using motion models. We 
then introduce in more detail our approach to enforcing the motion models, 
with or without assuming that the style remains constant. Finally, we discuss 
the computational requirements of our scheme and contrast them with those of 
more traditional probabilistic approaches. 

4.1 Tracking without Motion Models 

In this paper we use stereo data acquired using a Digiclops tm operating at a 
640 x 480 resolution and a 14Hz framer ate, which is relatively slow when it 
comes to capturing a running motion. The quality of the data is poor for several 
reasons. First, to avoid motion blur, we had to use a high shutter speed that 
reduces exposure too much. Second, because the camera is fixed and the subject 
must remain within the capture volume, she appears to be very small at the 
beginning of the sequence. As a result the data of Fig. 2 is very noisy and lacks 
both resolution and depth. 




Fig. 3. Tracking without a motion model. Given the low framerate, motion between 
frames is large enough to provoke erroneous attachments of data points to body parts 
and, as a consequence, very poor fitting behavior. The whole sequence is shown. 
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To establish a baseline, in Fig. 3, we show the unsatisfactory result of fitting 
our model to this data without using motion models, that is by minimizing the 
objective function of Eq. 4 in each frame separately. We simply use the recovered 
pose in frame t — 1 as the starting point in frame t. 

Careful analysis shows that tracking fails chiefly due to the low framerate, as 
the interframe motion is too large. This prevents the process of “attaching” data 
points to body parts discussed in Section 3.1 from functioning properly. In the 
fifth image of Fig. 3, both legs end up being “attracted” to the same data points. 



4.2 Tracking a Steady Motion 

To remedy the problems discussed above, we can first assume that the motion 
is steady over T data frames and, therefore, that the cq coefficients of Eq. 6 are 
invariant. The motion state vector is taken to be 



(j) — [jt, ~ti] — [/ii, • • • , fir, <M, • • , ? a m ] 



(8) 



To effectively minimize the objective function F of Eq. 7 using a standard least- 
squares technique [16], we need to evaluate its Jacobian. Bearing in mind that 
the derivatives of F with respect to the individual joints angles can be easily 
computed [18], this can be readily done as follows: 

dF_ _ d6j_ 5F dF ^ dOj OF 

dOj ’ 9fjb t dfjb t dOj 

Because the Oj are linear combinations of the &i eigenvectors, is simply the 
6hj, the j th coordinate of Oi. Similarly, we can write 



dOj 

dfH 



E 



a* 



dOjj 

dfH 



where the can be evaluated using hnite differences and stored when building 
the motion database, as shown in Fig. 4. 




Fig. 4. Motion vector and its temporal derivatives. Left: First 5 eigenvectors for the 

flexion-extension in the sagittal plane of the left knee. Right: Temporal derivatives 

gg j£ 
dut 
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Fig. 5. Using low resolution stereo data to track a woman whose motion was recorded in 
the database. The recovered skeleton poses are overlaid in white. The legs are correctly 
positioned. 




Fig. 6. Tracking a walking motion assuming a constant style. The legs are correctly 
positioned. 



Figure 5 depicts results on a walking sequence performed by a subject whose 
motion was captured when building the database. Note that the legs are correctly 
positioned. The errors in the upper-body are due to the noisyness of the stereo 
cloud. 

Fig. 6 displays the results on a walking sequence performed by a subject who 
was not recorded when building the database. To validate our results, he is wear- 
ing four gyroscopes on his legs, one for each sagittal rotation of the hip and knee 
joints. The angular speeds they measure are used solely for comparison purposes 
and we show their integrated values in Fig. 7. We overlay on the corresponding 
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Fig. 7. Comparing recovered rotation angles using visual tracking (solid curve), and by 
integrating gyroscopic data (smooth curve) for the walk of Fig. 6. Left column: Right 
hip and knee sagittal rotations. Right Column: Same thing for the left leg. Note that 
both curves are very close in all plots, even though the left leg is severely occluded. 



Fig. 8. Tracking a running motion assuming a constant style. The legs are correctly 
positioned except the left one in the first frame. 



plots the values recovered by our tracker. Note that they are very close, even 
though the left leg is severely occluded. 

Fig. 8 depicts results on the running sequence of Fig. 2 using the running 
database of Section 3.2, which are much better than those of Fig. 3. The pose 
of the legs is now correctly recovered, except the one of the left leg in the first 
frame. This is due in part to the fact that the database was acquired using a 
treadmill and is therefore too sparse to model a motion in which the leg is raised 
that high, and in part to the fact that the motion is not truly steady. We address 
these issues below. 
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Fig. 9. Tracking a running motion while allowing the style to vary. The legs are now 
correctly positioned in the whole sequence. 



4.3 Tracking a Variable Style and Speed Motion 

In the sequences shown in this paper, speed and style are not truly constant. 
Because of space constraints, the subject starts, accelerates and stops over a short 
distance. This is true for walking and running, and even more so for transitions 
from one to the other. Using a single set of oti parameters for the whole sequence 
as in Section 4.2 therefore overconstrains the problem. We relax these constraints 
by introducing a set of oti per frame or per set of frames and the state vector 
then becomes: 

</> = 0(7^, c?) where a 1 = (a \, ..., a l m ) . (10) 

Improved tracking results from the running sequence of Fig. 2 are shown in 
Fig. 9. The system now has enough freedom to raise the leg in the first frame 
while still positioning the legs correctly everywhere else. Upper body tracking 
remains relatively imprecise because average errors in the stereo data are larger 
than the distance between torso and arms. Improving this would require the use 
of additional information, such as silhouette information, which could easily be 
done within the proposed framework. Similar results for walking are shown in 
Fig. 10. Small errors in foot positioning are due to the fact that ankle flexion 
has not been recorded in the motion database. 

Having a set of PC A parameters per frame gives the system the freedom 
to automatically evolve from one activity to another. To demonstrate this, in 
Fig. 11, we use our full motion database to track a transition from walking 
to running. In the first few frames the subject is walking, then for a couple of 
frames she performs the transition and runs for the rest of the sequence. The 
arms are not tracked because we focus on estimating the motion parameters 






Fig. 10. Tracking a walking motion while allowing the style to vary. The sequence has 
a total of 18 frames, we show one in two. 



of the lower body only. Here again, the legs are successfully tracked with small 
errors in foot positioning that are due to the fact that ankle flexion is not part 
of the motion database. 

4.4 Computational Requirements 

Probabilistic approaches such as the one of [8] rely on randomly generating “par- 
ticles” and evaluating their fitness. Assuming the cost of creating the particles 
to be negligible, the main cost of each iteration comes from evaluating an objec- 
tive function, such as the function F of Eq. 7 for each particle. In the classical 
implementation of the condensation, where the state vector has ndofs degrees 
of freedom, the cost is therefore in the order of 0(npart (ndofs)) times the cost 
of computing F, where npart is the number of particles, which tends to grow 
very fast if good convergence properties are to be achieved. On the other hand, 
if we use our motion models to perform the condensation, the cost is of the order 
of 0(npart(n )), which also grows fast with n, the state’s vector dimension. 

By contrast, the main cost of each iteration of our optimization scheme comes 
from evaluating F and its Jacobian, which is of course more expensive than 
evaluating F alone. However, through careful implementation, we have found 
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Fig. 11. Tracking the transition between walking and running. In the first four frames 
the subject is running. The transition occurs in the following three frames and the 
sequence ends with running. The whole sequence is shown. 



that it can be done at a cost in the order of 0(\g(ndof)) times the cost of 
computing F alone, since evaluating F and its derivatives for the ndof degrees of 
freedom in the body model involves many similar computations, and computing 
7^: once per iteration is what is costly. It took less than 15 iterations to achieve 
convergence. As a consequence, the cost of the methods of Section 4.2 and 4.3 
are of the same order and smaller than the probabilistic approach. 

5 Conclusion 



We have presented an approach using motion models that allows us to formulate 
the tracking problem as one of minimizing a differential objective function with 
respect to relatively few parameters. We take them to be the first few coefficients 
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of the principal components of the joint angle space for motions captured using 
an optical motion capture device. 

Using walking and running as examples, we have shown that this repre- 
sentation, while having a fairly low dimension, nevertheless has a rich enough 
differential structure to yield good performance at a low computational cost. It 
also has the ability to capture the transition from one motion to another. 

We have demonstrated that our approach can simultaneously handle two dif- 
ferent activities. Our method seems perfectly adapted to 3-D analysis of sport 
activities such as a golf swing or a tennis serve. The same can be said of captur- 
ing the motion of orthopedic patients when they are asked to perform a paticular 
routine designed to evaluate their conditions. Applying our method to such ac- 
tivities will be a subject for future research. 

Currently, the major limitation comes from the small size of the database 
we use, which we will endeavor to complete. This should allow us to precisely 
track a wider range of styles, perhaps at the cost of adding some regularization 
constraints that we presently do not need. We also plan to add additional motion 
types, such as jumping, for which motion capture data is fairly easy to acquire. 
In the current database, the samples corresponding to different people tend to 
cluster. If this remains true when the database is completed, this may become a 
promising approach not only for tracking but also for recognition. 
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Abstract. Computer vision tasks often require the robust fit of a model to some 
data. In a robust fit, two major steps should be taken: i) robustly estimate the 
parameters of a model, and ii) differentiate inliers from outliers. We propose a 
new estimator called Adaptive-Scale Residual Consensus (ASRC). ASRC 
scores a model based on both the residuals of inliers and the corresponding 
scale estimate determined by those inliers. ASRC is very robust to multiple- 
structural data containing a high percentage of outliers. Compared with 
RANSAC, ASRC requires no pre-determined inlier threshold as it can 
simultaneously estimate the parameters of a model and the scale of inliers 
belonging to that model. Experiments show that ASRC has better robustness to 
heavily corrupted data than other robust methods. Our experiments address two 
important computer vision tasks: range image segmentation and fundamental 
matrix calculation. However, the range of potential applications is much broader 
than these. 



1 Introduction 

Unavoidably, computer vision data is contaminated (e.g., faulty feature extraction, 
sensor noise, segmentation errors, etc) and it is also likely that the data include 
multiple structures. Considering any particular structure, outliers to that structure can 
be classified into gross outliers and pseudo outliers [16], the latter being data 
belonging to other structures. Computer vision algorithms should be robust to outliers 
including pseudo outliers [6] . Robust methods have been applied to a wide variety of 
tasks such as optical flow calculation [1, 22], range image segmentation [24, 15, 11, 
10, 21], estimating the fundamental matrix [25, 17, 18], etc. 

The breakdown point is the smallest percentage of outliers that can cause the 
estimator to produce arbitrarily large values ([13], pp.9.). Least Squares (LS) has a 
breakdown point of 0%. To improve on LS, robust estimators have been adopted from 
the statistics literature (such as M-estimators [9], LMedS and LTS [13], etc) but they 
tolerate no more than 50% outliers, limiting their suitability [21]. The computer vision 
community has also developed techniques to cope with outliers: e.g., the Hough 
Transform [8], RANSAC [5], RESC [24], MINPRAN [15], MUSE [11], ALKS [10], 
pbM-estimator [2], MS AC and MLESAC [17]. The Hough Transform determines 
consensus for a fit from “votes” in a binned parameter space: however one must 
choose the bin size wisely and, in any case, this technique suffers from high cost 
when the number of parameters is large. Moreover, unlike the other techniques, it 
returns a limited precision result (limited by the bin size). RANSAC requires a user- 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3023, pp. 107-118, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




108 



H. Wang and D. Suter 



supplied error tolerance. RESC attempts to estimate the residual probability density 
function but the method needs the user to tune many parameters and we have found 
that it overestimates the scale of inliers. MINPRAN assumes that the outliers are 
randomly distributed within a certain range, making MINPRAN less effective in 
extracting multiple structures. MUSE requires a lookup table for the scale estimator 
correction and ALKS is limited in its ability to handle extreme outliers. 

In this paper, we propose (section 0) a new robust estimator: Adaptive-Scale 
Residual Consensus (ASRC), which is based on a robust two-step scale estimator 
(TSSE) (section 0). We apply ASRC to range image segmentation and fundamental 
matrix calculation (section 0) demonstrating that ASRC outperforms other methods. 



2 A Robust Scale Estimator: TSSE 



TSSE [23] is derived from kernel density estimation techniques and the mean 
shift/mean shift valley method. Kernel estimation is a popular method for probability 
density estimation [14]. For n data points {X.}. =1 n in a 1-dimensional residual space, 
the kernel density estimator with kernel K and bandwidth h is ([14], p.76): 



nh 



x-X ; 



( 1 ) 



The Epanechnikov kernel ([14], p.76) 

K(X) = -( l-X 2 ) if (1- X 2 ) > 0; 0 otherwise (2) 

4 



is optimum in terms of minimum mean integrated square error (MISE), satisfying 
various conditions ([19], p.95). Using such a kernel, the mean shift vector M h (x) is: 



M h (x) 



= — (X,~x) = 

H x X^Sbix) 



— X f 



n 



l x X t eS h (x) 



( 3 ) 



where S h (x ) is a hypersphere of the radius h , having the volume h d c d (c d is the volume 
of the unit ^-dimensional sphere, e.g., c=2), centered at x, and containing n x data points. 

Marching in the direction of this vector we perform gradient ascent to the peak. 
However, for TSSE we also need to find the valleys. Based upon the Gaussian kernel, 
a saddle-point seeking method was published in [4] but we employ a more simple 
method [20], based upon the Epanechnikov kernel and, for our purposes, in 1-D 
residual space. The basic idea is to define the mean shift valley vector as: 

MV h (x) = -M h (*) = *-— X V (4) 

n x X^S h {x) 

In order to avoiding the oscillations, we modify the step size as follows. Let 
(y.}. = i 2 be the sequence of successive locations of the mean shift valley procedure, 
then we have, for each *=1,2. . ., 



y, +1 =y,+ T'MV h (y.) 



( 5 ) 
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where t is a correction factor, and 0 <t <1. If the shift step at y. is too large, it causes 
y i+l to jump over the local valley and thus oscillate over the valley. This problem can 
be avoided by adjusting t so that MV h (y.) T MV h (y. +1 )>0. 

A crucial issue in implementing the TSSE is the kernel bandwidth choice [19, 3]. A 
simple over- smoothed bandwidth selector can be employed [19]. 



h = 



243 R(K) 
3 5u 2 (K) 2 n 



S 



( 6 ) 



where R(K) = ^K(£) 2 d£ and u 2 (K) = J ^ 2 K(()d( . S is the sample standard deviation. 

The median [13], MAD [12] or robust k [10] scale estimator can be used to yield 
an initial scale estimate. It is recommended that the bandwidth be set as ch , (0<c<l) 
to avoid over- smoothing ([19], p.62). 

We can now describe the TSSE process: 

1. Use mean shift, with initial center zero, to find the local peak, and then we use 
the mean shift valley to find the valley next to the peak: all in ascending ordered 
absolute residual space. 

2. Estimate the scale of the fit by the median scale estimator [13] on the points 
whose residuals are within the obtained band centered at the local peak. 

Based on TSSE, a new robust estimator (ASRC) will be provided in the next section. 



3 Robust Adaptive-Scale Residual Consensus Estimator 



We assume that when a model is correctly found, two criteria should be satisfied: 

• The (weighted) sum of absolute residuals (r . ) of the inliers should be small. 

• The scale ( S ) (standard variance) of the inliers should be small. 

Given S , the inliers are those that satisfy: 

k*/s„|<r (7) 



where T is a threshold. If T is 2.5(1.96), then 98%(95%) percent of a Gaussian 
distribution will be identified as inliers. In our experiments, T=2.5 (except for section 
0 where T=1.96) 



(l-r*/(S/» 



0 = arg max (■ — 



( 8 ) 



where n $ in is the number of inliers which satisfies equation (7) for the fitted 6 . 

No priori knowledge about the scale of inliers is necessary as the proposed method 
yields the estimated parameters of a model and the corresponding scale 
simultaneously. 

The ASRC estimator algorithm is as follows (for fitting models with p parameters): 
1. Randomly choose one p-subset from the data points, estimate the model 
parameters using the p-subset, and calculate the ordered absolute residuals. 
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2. Choose the bandwidth by equation (6). A robust k scale estimator [10] (k= 0.2) is 
used to yield a coarse initial scale S 0 . 

3. Apply TSSE to the absolute sorted residuals to estimate the scale of inliers S r 
Because the robust k scale estimator is biased for data with multiple structures, 
use Sj in equation (6) to apply TSSE again for the final scale of inliers S 2 . 

4. Validate the valley. The probability density at the local peak f(peak) and local 
valley /(valley) are obtained by equation (1). Let f (valley) / f (peak) = 2 (where 
1> 2 >0). Because the inliers are assumed having a Gaussian-like distribution, the 
valley is not sufficiently deep when 2 is too large (say, larger than 0.8). If the 
valley is sufficiently deep, go to step (5); otherwise go to step (1). 

5. Calculate the score, i.e., the objective function of the ASRC estimator. 

6. Repeat step (1) to step (5) m times. Finally, output the parameters and the scale S 2 
with the highest score. 

Let s be the fraction of outliers, P the probability that at least one of the m p-tuples 
is “clean”; then one can determine ra by ([13], pp.198): 

l0g(1 ~ P) ( 9 ) 

log[l-(l -e) p ] 

In [23], we propose a robust Adaptive Scale Sample Consensus (ASSC) estimator: 

6 = arg max(» &n / Sg) (10) 

From equation (8) and (10), we can see that the difference between ASRC and our 
recently proposed ASSC [23] is: in ASSC, all inliers are treated as the same, i.e., each 
inlier contributes 1 to the object function of ASSC. However, in ASRC, the sizes of 
the residuals of inliers are influential. 



4 Experiments 

4.1 Synthetic Examples on Line Fitting and Plane Fitting 

The proposed method is compared with LMedS, RESC, ALKS, and our recently 
proposed method: ASSC. We generated four examples: roof, ‘F’ -figure, one-step, and 
three-step linear signals (the signals are in the magenta color), each with a total of 500 
data points, corrupted by Gaussian noise with zero mean and standard variance a : 
Among the 500 data points, a data points were randomly distributed in the range of 
(0, 100). The V th structure has n. data points: (a) Roof: x:(0-50), y=2x, ^=65; x:(50- 
100), y=200-2x, n= 50; oc=385; c=l. (b) F-figure: x:(25-75), y=85, rc=40; x:(25-75), 
y=70, n= 35; x=25, y:(30-85), n= 35; a=390; a=1.2. (c) Step: x:(0-50), y=75, ^=45; 
x:(50-100), y=60, n=45\ a=410; 0=1. (d) Three-step: x:(0-25), y=20, n= 45; x:(25- 
50), y=40, n= 30; x:(50-75), y=60, n= 30; x:(75-100), y=80, n= 30; a=365; a= 1. 

From Fig. 1 we can see that ASRC correctly fits all four signals. LMedS (50% 
breakdown point) failed to fit all four. Although ALKS is sometimes more robust, it 
also failed. RESC and ASSC succeeded in the roof signal (87% outliers), however, 
they both failed in the other three cases. It should be emphasized that both the 
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(c) (d) 

Fig. 1 . Comparing the performance of five methods: (a) fitting a roof with a total of 87% 
outliers; (b) fitting F-figure with a total of 92% outliers; (c) fitting a step with a total of 91% 
outliers; (d) fitting three- step with a total of 91% outliers. 



Result by ASRC 



Result by ASSC 




Fig. 2. (a) the 3D data with 80% outliers; the extracted results by (b) ASRC; (c) ASSC; (d) 
RESC; (e) ALKS; and (f) LMedS. 



bandwidth choice and the scale estimation in ASRC are data-driven: an improvement 
over RANSAC where the user sets a priori scale-related error bound. 

Next, two 3D signals were used: 500 data points and three planar structures with 
each plane containing n points corrupted by Gaussian noise with standard variance a 
(=3.0); 500-3/t points are randomly distributed. In the first example, n =100; in the 
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second n = 65. We repeat: (1) estimate the parameters and scale of a plane (2) extract 
the inliers and remove them from the data set - until all planes are extracted. The red 
circles denote the first plane extracted; green stars the second; and blue squares the 
third (Fig. 2 and Fig. 3). 

From Fig. 2 (d) and (e), we can see that RESC and ALKS, which claim to be 
robust to data with more than 50% outliers, failed to extract all the three planes. This 
is because the estimated scales (by RESC and ALKS) for the first plane were wrong, 
which caused these two methods to fail to fit the second and third planes. Because the 
LMedS (in Fig. 2 (d)) has only a 50% breakdown point, it completely failed to fit data 
with such high contamination — 80% outliers. The proposed method and AS SC 
yielded the best results (Fig. 2 (b) and (c)). Similarly, in the second 3D experiment 
(Fig. 3), RESC, ALKS and LMedS completely broke down. ASSC, although it 
correctly fitted the first plane, wrongly fitted the second and the third planes. Only the 
proposed method correctly fitted and extracted all three planes (Fig. 3 (b)). 



Result by ASRC 



Result by ASSC 




(d) 



(e) 



(f) 



Fig. 3. (a) the 3D data with 87% outliers; the extracted results by (b) ASRC; (c) ASSC; (d) 
RESC; (e) ALKS; and (f) LMedS. 



4.2 Range Image Segmentation 

Many robust estimators have been employed to segment range images ([24, 11, 10, 
21], etc.). Here, we use the ABW range images (obtained from http://marathon.csee. 
usf.edu/seg-comp/SegComp.html.) The images have 512x512 pixels and contain 
planar structures. We employ a hierarchal approach with four levels [21]. The bottom 
level of the hierarchy contains 64x64 pixels that are obtained by using regular 
sampling on the original image. The top level of the hierarchy is the original image. 
We begin with bottom level. In each level of the hierarchy, we: 

(1) Apply the ASRC estimator to obtain the parameters of plane and the scale of 
inliers. If the number of inliers is less than a threshold, go to step (6). 

(2) Use the normals to the planes to validate the inliers obtained in step (1). When 
the angle between the normal of the data point that has been classified as an 
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inlier, and the normal of the estimated plane, is less than a threshold value, the 
data point is accepted. Otherwise, the data point is rejected and will be left for 
further processing. If the number of the validated inliers is small, go to step (6). 

(3) Fill in the holes, which may appear due to sensor noise, inside the maximum 
connected component (CC) from the validated inliers. 

(4) In the top hierarchy, assign a label to the points corresponding to the CC from 
step (3) and remove these points from the data set. 

(5) If a point is unlabelled and it is not a jump edge point, the point is a "left-over" 
point. After collecting all these, use the CC algorithm to get the maximum CC. 
If the number data points of the maximum CC of "left-over" points is smaller 
than a threshold, go to step (6); otherwise, sample the maximum CC obtained in 
this step, then go to step (1). 

(6) Terminate the processing in the current level of the hierarchy and go to the 
higher-level hierarchy until the top of the hierarchy. 




(bl) (b2) (b3) 



Fig. 4. Segmentation of ABW range images from the USF database, (al, bl) Range image with 
26214 random noise points; (a2, b2) The ground truth results for the corresponding range 
images without adding random noise; (a3, b3) Segmentation result by ASRC. 



The proposed range image segmentation method is very robust to noise. We added 
26214 random noise points to the range images (in Fig. 4) taken from the USF ABW 
range image database (“test 11” and “test 3”). No separate noise filtering is 
performed. All of the main surfaces were recovered by our method. 






(d) (e) (f) 

Fig. 5. Comparison of the segmentation results for ABW range image (train 7). (a) Range 
image; (b) The result of ground truth; (c) The result by the ASRC; (d) The result by the UB; (e) 
The result by the WSU; (f) The result by the USF. 

We also compared our results with those of three state-of-the-art approaches of 
USF, WSU, and UB [7]. Fig. 5 (c-f), showing the results obtained by the four 
methods should be compared with the results of the ground truth (Fig. 5 (b)). 

From Fig. 5, we can see that the proposed method achieved the best results: all 
surfaces are recovered and the segmented surfaces are relatively “clean”. In 
comparison, some boundaries on the junction of the segmented patch by the UB were 
seriously distorted. The WSU and USF results contained many noisy points and WSU 
over segmented one surface. The proposed method takes about 1-2 minutes (on an 
AMD800MHz personal computer in C interfaced with MATLAB language). 



4.3 Fundamental Matrix Estimation 

Several robust estimators, such as M-estimators, LMedS, RANSAC, MSAC and 
MLESAC, have been applied in estimating the fundamental matrix [17]. However, 
M-estimators and the LMedS have a low breakdown point, RANSAC and MSAC 
need a- priori knowledge about the scale of inliers. MLESAC performs similar to 
MSAC. 

The proposed ASRC can tolerate more than 50% outliers; and no priori scale 
information about inliers is required. 
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Let {x } and {x’} (for i=l,...,n) to be a set of homogeneous image points viewed 
in image 1 and image 2. We have the following constraints for the fundamental 
matrix F\ 



x 7 Fx i = 0 and det[F] = 0 



( 11 ) 



We employ the 7 points algorithm [17] to solve for candidate fits using Simpson 
distance - for the V th correspondence r. using Simpson distance is: 

( 12 ) 



{ k x + K +k l +k ?) 

where k, = fax, + f^y, + f 3 x,g+ f A y\x, + f 5 y',y, + f 6 y\g+ f 1 x i g+ f s y t g+ f.g 2 . 




Fig. 6. An experimental comparison of estimating fundamental matrix for data with 60% 
outliers, (a) The distributions of inliers and outliers; (b) The distribution of true inliers; The 
inliers obtained by (c) ASRC; (d) MSAC; (e) RANSAC; and (f) LMedS. 

Table 1 . An experimental comparison for data with 60% outliers. 





% of inliers correctly 
classified 


% of outliers correctly 
classified 


Standard variance 
of inliers 


Ground Truth 


100.00 


100.00 


0.9025 


ASRC 


95.83 


100.00 


0.8733 


MSAC 


100.00 


65.56 


41.5841 


RANSAC 


100.00 


0.56 


206.4936 


LMedS 


100.00 


60.00 


81.1679 



We generated 300 matches including 120 point pairs of inliers with unit Gaussian 
variance (matches in blue color in Fig. 6(a)) and 160 point pairs of random outliers 
(matches in cyan color in Fig. 6(a)). Thus the outliers occupy 60% of the whole data. 
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The scale information about inliers is usually not available, thus, the median scale 
estimator, as recommended in [17], is used for RANSAC and MS AC to yield an 
initial scale estimate. The number of random samples is set to 10000. From Fig. 6 and 
Table 1, we can see that our method yields the best result. 




(a) 



(b) 



(c) 



Fig. 7. A comparison of correctly identified percentage of inliers (a), outliers (b), and the 
comparison of standard variance of residuals of inliers (c). 



Table 2. Experimental results on two frames of the Corridor sequence. 





Number of inliers 


Mean error of inliers 


Standard variance of inliers 


ASRC 


269 


-0.0233 


0.3676 


MSAC 


567 


-0.9132 


7.5134 


RANSAC 


571 


-1.2034 


8.0816 


LMedS 


571 


-1.1226 


8.3915 



Next, we investigate the behavior for data involving different percentages of 
outliers (PO). We generated the data (in total 300 correspondences) similar to that in 
Fig. 6. The percentage of outliers varies from 5% to 70% in increments of 5%. The 
experiments were repeated 100 times for each percentage of outliers. If a method is 
robust enough, it should resist the influence of outliers and the correctly identified 
percentages of inliers should be around 95% ( T is set 1.96 in equation (7)) and the 
standard variance of inliers should be near to 1.0 despite of the percentages of 
outliers. 

We set the number of random samples, m, to be: m =1000 when PCK40; 10000 
when 40<PO<60; and 30000 when PO>60 to ensure a high probability of success. 

From Fig. 7, we can see that MSAC, RANSAC, and LMedS all break down when 
data involve more than 50% outliers. The standard variance of inliers by ASRC is the 
smallest when PO >50%. Note: ASRC succeeds to find the inliers and outliers even 
when outliers occupied 70% of the whole data. 

Next, two frames of the Corridor sequence (bt.000 and bt.004), were obtained from 
http://www.robots.ox.ac.uk/~vgg/data/ (Fig. 8(a) and (b)). Fig. 8(c) shows the 
matches involving 800 point pairs in total. The inliers (269 correspondences) obtained 
by the proposed method are shown in Fig. 8(d). The epipolar lines and epipole using 
the estimated fundamental matrix by ASRC are shown in Fig. 8(e) and (f). In Fig. 8(e) 
and (f), we draw 30 epipolar lines. We can see that most of the point pairs correspond 
to the same feature in the two images except for one case: the 30th point pair, which 
is pointed out by the two arrows. The reason is that the residual of the point pair 
corresponding to the estimated fundamental matrix is small: the epipolar constraint is 







Robust Fitting by Adaptive- Scale Residual Consensus 



117 



a weak constraint and ANY method enforcing ONLY the epipolar constraint scores 
this match highly. Because the camera matrices of the two frames are available, we 
can obtain the ground truth fundamental matrix and thus evaluate the errors (Table 2). 
We can see that ASRC performs the best. 




(d) (e) (f) 

Fig. 8. (a)(b) image pair (c) matches (d) inliers by ASRC; (e)(f) epipolar geometry. 



5 Conclusion 

The proposed ASRC method exploits both the residuals of inliers and the 
corresponding scale estimate using those inliers, in determining the merit of model fit. 
This estimator is very robust to multiple- structural data and can tolerate more than 
80% outliers The ASRC estimator is compared to those of several popular and 
recently proposed robust estimators: LMedS, RANSAC, MSAC, RESC, ALKS, and 
AS SC, showing that the ASRC estimator achieves better results (Readers may 
download the paper from http://www-personal.monash.edu.au/~hanzi , containing the 
corresponding colors figure/images, to understand the results better). Recently, a 
“pbM-estimator”[2], also using kernel density estimation was announced. However, 
this employs projection pursuit and orthogonal regression. In contrast, we consider the 
density distribution of the mode in the residual space. 
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Abstract. The problem of Simultaneous Localization And Mapping (SLAM) 
originally arose from the robotics community and is closely related to the 
problems of camera motion estimation and structure recovery in computer 
vision. Recent work in the vision community addressed the SLAM problem 
using either active stereo or a single passive camera. The precision of camera 
based SLAM was tested in indoor static environments. However the extended 
Kalman filters (EKF) as used in these tests are highly sensitive to outliers. For 
example, even a single mismatch of some feature point could lead to 
catastrophic collapse in both motion and structure estimates. In this paper we 
employ a robust-statistics-based condensation approach to the camera motion 
estimation problem. The condensation framework maintains multiple motion 
hypotheses when ambiguities exist. Employing robust distance functions in the 
condensation measurement stage enables the algorithm to discard a 
considerable fraction of outliers in the data. The experimental results 
demonstrate the accuracy and robustness of the proposed method. 



1 Introduction 

While the vision community struggled with the difficult problem of estimating motion 
and structure from a single camera generally moving in 3D space (see [5]), the 
robotics community independently addressed a similar estimation problem known as 
Simultaneous Localization and Mapping (SLAM) using odometery, laser range 
finders, sonars and other types of sensors together with further assumptions such as 
planar robot motion. Recently, the vision community has adopted the SLAM name 
and some of the methodologies and strategies from the robotics community. Vision 
based SLAM has been proposed in conjunction with an active stereo head and 
odometry sensing in [7], where the stereo head actively searched for old and new 
features with the aim of improving the SLAM accuracy. In [6] the more difficult issue 
of localization and mapping based on data from a single passive camera is treated. 
The camera is assumed to be calibrated and some features with known 3D locations 
are assumed present and these features impose a metric scale on the scene, enable the 
proper use of a motion model, increase the estimation accuracy and avoid drift. These 
works on vision based SLAM employ an Extended Kalman Filter (EKF) approach 
where camera motion parameters are packed together with 3D feature locations to 
form a large and tightly coupled estimation problem. The main disadvantage of this 
approach is that even a single outlier in measurement data can lead to a collapse of the 
whole estimation problem. Although there are means for excluding problematic 

T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3023, pp. 119-131, 2004. 
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feature points in tracking algorithms, it is impossible to completely avoid outliers in 
uncontrolled environments. These outliers may result from mismatches of some 
feature points which are highly likely to occur in cluttered environments, at depth 
discontinuities or when repetitive textures are present in the scene. Outliers may exist 
even if the matching algorithm performs perfectly when some objects in the scene are 
moving. In this case multiple-hypothesis estimation as naturally provided by particle 
filters is appropriate. The estimation of the stationary scene structure together with the 
camera ego-motion is the desired output under the assumption that most of the 
camera’s field of view looks at a static scene. The use of particle filters in SLAM is 
not new. Algorithms for FastSLAM [19] employed a particle filter for the motion 
estimation, but their motivation was mainly computational speed and robust 
estimation methodology was neither incorporated nor tested. In [18] a version of 
FastSLAM addressing the problem of data association between landmarks and 
measurements is presented. However, the solution to the data association problem 
provided there does not offer a solution to the problem of outliers since all landmarks 
are assumed stationary and every measurement is assumed to correctly belong to one 
of the real physical landmarks. Other works like e.g. [6] employed condensation only 
in initialization of distances of new feature points before their insertion into the EKF. 
However the robustness issue is not solved in this approach since the motion and 
mapping are still provided by the EKF. In [23] the pose of the robot was estimated by 
a condensation approach. However, here too the algorithm lacked robust statistics 
measures to effectively reject outliers in the data. Furthermore the measurements in 
this work were assumed to be provided by laser range finders and odometric sensors. 
In this work we propose a new and robust solution to the basic problem of camera 
motion estimation from known 3D feature locations, which has practical importance 
of its own. The full SLAM problem is then addressed in the context of supplementing 
this basic robust camera motion estimation approach for simultaneously providing 
additional 3D scene information. The paper is organized as follows: Section 2 
formulates the basic motion estimation problem. Section 3 presents the proposed 
framework for robust motion from structure. Section 4 discusses methods for 
incorporating the proposed framework for the solution of SLAM. Section 5 presents 
results on both synthetic data and real sequences and compares the performance to 
that of EKF based methods. 



2 Problem Formulation 



Throughout this work it is assumed that the camera is calibrated. This assumption is 
commonly made in previous works on vision based SLAM. A 3D point indexed by i 

in the camera axes coordinates, (x.(0 Y. (t) Z.(t)Y projects to the image point 
(*.« y ,(t)Y at frame time t via some general projection function fl as follows: 
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The camera motion between two consecutive frames is represented by a rotation 
matrix R(t) and a translation vector V(t). Hence for a static point in the scene: 
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The rotation is represented using the exponential canonical form R(t) = e® {t) where 
ai(t) represents the angular velocity between frames t-1 and t, and the exponent 
denotes the matrix exponential. The hat notation for some 3D vector q is defined by: 
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The matrix exponential of such skew- symmetric matrices may be computed using the 
Rodrigues’ formula: 

=/+ pHNI )+ ^T( 1 - cos( MI ) ) 



m 



Let us denote by Q(f) and T(t) the overall rotation and translation from some fixed 
world coordinate system to the camera axes: 

( 3 ) 

+ T{t) 

Equation (3) describes the pose of the world relative to the camera. The camera pose 
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relative to the world is given by: Q,^ mera = -Q(t) ; T ( 
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Using equations (2), (3) and (3) written one sample backward: 






Q(7) - log 
T(t) = e m T(t-\) + V(t) 



( 4 ) 



Where, q = log£ 0(3) (A) denotes the inverse of the matrix exponential of the skew 

symmetric matrix A such that A = e q (i.e. inverting Rodrigues’ formula). Let us 
define the robust motion from structure estimation problem: given matches of 2D 
image feature points to known 3D locations, estimate the camera motion in a robust 
framework accounting for the possible presence of outliers in measurement data. 
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2.1 Dynamical Motion Model 

One can address the camera motion estimation problem with no assumptions on the 
dynamical behavior of the camera (motion model), thus using only the available 
geometric information in order to constrain the camera motion. This is equivalent to 
assuming independent and arbitrary viewpoints at every frame. In most practical 
applications though, physical constraints result in high correlation of pose between 
adjacent frames. For example, a camera mounted on a robot traveling in a room 
produces smooth motion trajectories unless the robot hits some obstacle or collapses. 
The use of a proper motion model accounts for uncertainties, improves the estimation 
accuracy, attenuates the influence of measurement noise and helps overcome 
ambiguities (which may occur if at some time instances, the measurements are not 
sufficient to uniquely constrain camera pose, see [5] and [6]). Throughout this work, 
the motion model assumes constant velocity with acceleration disturbances, as 
follows: 



co(t) = co(t-\) + (b(t) (5) 

V(t) = V(t-l) + V(t) 

If no forces act on the camera the angular and translation velocities are constant. 
Accelerations result from forces and moments which are applied on the camera, and 
these being unknown are treated as disturbances (recall that the vectors w(t),V(t) are 

velocity terms and the time is the image frame index). 

Acceleration disturbances are modeled here probabilistically by independent white 
Gaussian noises: 



6)(t) ~ N(Q,a 0> ) (6) 

V(t)~N«\(J v ) 

where and denote expected standard deviations of the angular and linear 
acceleration disturbances. 



3 Robust Motion from Structure by Condensation 

In this section we present the proposed condensation based algorithm designed for 
robust camera 3D motion estimation. A detailed description of condensation in 
general and its application to contour tracking can be found in [12] and [13]. The state 
vector of the estimator at time t, denoted by s t , includes all the motion parameters: 

s ( =(Q(r) T(t) co(t) V(t)f 

The state vector is of length 12. The state dynamics are generally specified in the 
condensation framework by the probability distribution function p(s t \s t _ 1 ). Our 

motion model is described by equations (4), (5), (6). All measurements at time t are 
denoted compactly as z(t). The camera pose is defined for each state s t separately, with 
the corresponding expected projections being tested on all the visible points in the 
current frame: 
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The influence of the measurements is quantified by p(z(t)\s t ). This is the 
conditional Probability Distribution Function (PDF) of measuring the identified 
features z(t) when the true parameters of motion correspond to the state s t . The 
conditional PDF is calculated as a function of the geometric error, which is the 
distance denoted by d t between the projected 3D feature point location on the image 



plane and the measured image point. If the image measurement errors are statistically 
independent random variables with zero mean Gaussian PDF, then up to a 
normalizing constant: 



p(z | s) = exp 



I dt 

i=\ 



2ct 2 N, 



points 



Where A ints is the number of visible feature points and a is the standard deviation 



of the measurement error (about 1 pixel). Since outliers have large d t values even for 
the correct motion, the quadratic distance function may be replaced by a robust 
distance function p(df) see e.g. [20]: 



p(z | s) = exp 
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(7) 



( 8 ) 



If some feature point is behind the camera (this occurs when its 3D coordinates 
expressed in camera axes have a negative Z value), clearly this feature should not 
have been visible and hence its contribution to the sum is set to the value: 

ymy(/, 2 )=L 2 

The influence of every feature point on the PDF is now limited by the parameter L. 
The choice of L reflects a threshold value between inliers and outliers. In order to 
understand why robustness is achieved using such distance functions, let us consider 
the simpler robust distance function, the truncated quadratic: 

/>(</’)= K dl<Al 

[A 2 Otherwise 

where, A is the threshold value between inliers and outliers. Using this p function in 
equation (7) yields: 
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p(z | s ) = exp 
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Maximizing this PDF (a maximum likelihood estimate) is equivalent to minimizing 
the sum of the two terms, the first is the sum of the quadratic distances at the inlier 
points and the second term is proportional to the number of outliers. The robust 
distance function of equation (8) is similar to the truncated quadratic, with a smoother 
transition between the inliers and outliers (see [2] and [3] for an analysis of p 

functions used in robust statistics and their use for image reconstruction and for the 
calculation of piecewise-smooth optical flow fields). Let us summarize the proposed 
algorithm for robust 3D motion estimation from known structure: 

Initialization- Sample N states ,n = l...N from the prior PDF of co(0) , V(0) and 

£2(0) ;T(0) . Initialize 7T^ l) with the PDF corresponding to each state. 

At every time step t= 1,2, ... : 

• Sample N states s t ( ^ copied from the states s^\ with probabilities n { ™\ . 

• Propagate the sampled states using equations (6), (5), (4) to obtain s ( t n) . 

• Incorporate the measurements to obtain = p(z t \ s ( t n) ) using equations 

N 

(7), (8). Then normalize by the appropriate factor so that: = 1 

n = 1 

• Extract the dominant camera motion from the state s / (n) corresponding to the 
maximum of n\ n) : £2g mera = -£2$ ; T^ amera =-e~^h$ 

Code written in C++ implementing the algorithm of this section can be found in [25]. 
It can run in real time on a Pentium 4, 2.5GHz processor, with 30Hz sampling rate, 
1000 particles and up to 200 instantaneously visible feature points. 



4 Application to SLAM 

This section describes various possible solutions to the robust SLAM problem. 



4.1 SLAM in a Full Condensation Framework 

The most comprehensive solution to robust SLAM is the packing of all the estimated 
parameters into one large state and solve using a robust condensation framework. The 
state is composed of the motion parameters and each feature contributes three 
additional parameters for its 3D location. As stated in [7], this solution is very 
expensive computationally due to the large number of particles required to properly 
sample from the resulting high dimensional space. 
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4.2 SLAM in a Partially Decoupled Scheme 

The research on vision based SLAM tends to incorporate features with known 3D 
locations in the scene. The simplest method for incorporating the proposed robust 
motion from structure algorithm into SLAM is in a partially decoupled block scheme 
in which the features with known 3D locations are the input to the robust motion 
estimation block of section 3. Structure of other features in the scene can be recovered 
using the estimated motion and the image measurements. Assuming known 3D 
motion, the structure of each feature can be estimated using an EKF independently for 
each feature (similar to FastSlam in [19]). If enough features with known structure are 
available in the camera field of view at all times (few can be enough as shown in the 
experiments section), then this method can work properly. It may be practical for 
robots moving in rooms and buildings to locate known and uniquely identifiable 
features (fiducials) at known locations. When the motion estimation is robust, the 
independence of the estimators for the structure of the different features guarantees 
the robustness of the structure recovery as well. 

Image measurements 



known 

features 

► 




4.3 SLAM with Robust Motion Estimation and Triangulation 

In this section we propose a solution to the robust SLAM problem in a condensation 
framework with a state containing motion parameters only. In the measurement phase, 
features with known locations have their 3D structure projected on the image plane, 
features with unknown structure have their 3D structure reconstructed using 
triangulation (see [9] chapter 11) and the geometric error is measured by projecting 
this structure back on the image plane. The information regarding the camera pose in 
the current and previous frames is embedded in each state hypothesis of the 
condensation algorithm which together with the corresponding image measurements 
form the required information for the triangulation process. Triangulation can be 
performed from three views, where the third view is the first appearance of the 
feature. 



5 Experimental Results 

5.1 Synthetic Tests 

It has been experimentally found using synthetic tests that robustness with the 
proposed method is maintained with up to about 33% of outliers. The proposed 
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algorithm is compared with the results of the EKF approach in [5] which is run with 
the code supplied in [15]. The robust motion estimation used triangulation in three 
frames as described in section 4.3. The 3D structure was unknown to both algorithms. 
The outlier points are randomly chosen and remain fixed throughout the sequence, 
these points are given random image coordinates uniformly distributed in the image 
range (see examples in [25]). The rotation errors are compactly characterized by: 

T ^ 2 
J { g ^ True ^ Estimated 



\Frobenius 



The estimation results are shown in Fig. 1. With 33% of outliers, the EKF errors are 
unacceptable while the proposed method maintains reasonable accuracy. 





Fig. 1 . The translation (left) and rotation (middle) errors with no outliers in the data. Rotation 
errors with 33% of outliers (right) 



5.2 Real Sequence Example 

In this section a test consisting of 340 frames is described in detail. More sequences 
can be found in [25]. Small features (fiducials) were placed in known 3D locations 
(see Table 1) on the floor and on the walls of a room (see Fig. 2). Distances were 
measured with a tape having a resolution of 1 millimeter (0.1 cm). The fixed world 
coordinate system was chosen with its origin coinciding with a known junction on the 
floor tiles, the X and Z axes on the floor plane and parallel to the floor tiles and the Y 
axis pointing downwards (with -Y measuring the height above the floor). The balls 
are 1.4 and the circles are 1cm in diameter, the tiles are squares of 30x30cm. 



Table 1 . Scene fiducial geometry 



Serial 

number 


Type 


Color 


World axes location [cm] 


X 


Y 


Z 


1 


Ball 


Blue 


30 


-0.7 


180 


2 


Ball 


Green 


30 


-0.7 


210 


3 


Ball 


Yellow 


-60 


-0.7 


240 


4 


Ball 


Fight blue 


30 


-0.7 


240 


5 


Ball 


Black 


0 


-0.7 


270 


6 


Ball 


Red 


-30 


-0.7 


330 


7 


Ball 


Orange 


60 


-0.7 


360 


8 


Circle 


Fight blue 


-31 


-100.3 


388 


9 


Circle 


Fight blue 


29 


-120.7 


492.5 






Causal Camera Motion Estimation 



127 




Fig. 2. First frame of the sequence 

5.2.1 Camera Setup and Motion 

The camera was a Panasonic NV-DS60 PAL color camera with a resolution of 
720x576 pixels. The camera zoom was fixed throughout the test at the widest viewing 
angle. A wide field of view reduces the angular accuracy of a pixel, but enables the 
detection of more features (overall, [5] has experimentally found that a wide viewing 
angle is favorable for motion and structure estimation). The camera projection 
parameters at this zoom were obtained from a calibration process: 
v = 938 X/Z + 360.5 ; y = 1004 Y/Z + 288.5 

The camera was initially placed on the floor with the optical axis pointing 
approximately in the Z direction of the world. The camera was moved backwards by 
hand on the floor plane with the final orientation approximately parallel to the initial 
(using the tile lines). The comparison between the robust and the EKF approach is 
made with both having the same motion parameters in the estimated state, the same 
measurements and the same knowledge of the 3D data of table 1. The acceleration 
disturbance parameters for both methods are: a ^ - 0.003 , (7 y = 0.0005 . The number 

of particles is 2000 and the robust distance function parameter is L=4 pixels. 

5.2.2 Feature Tracking 

The features were tracked with a Kanade-Lucas-Tomasi (KLT) type feature tracker 
(see [21]). The tracker was enhanced for color images by minimizing the sum of 
squared errors in all three RGB color channels (the standard KLT is formulated for 
grayscale images). The tracking windows of size 9x9 pixels were initialized in the 
first frame at the center of each ball and circle by manual selection. To avoid the fatal 
effect of interlacing, the resolution was reduced in the vertical image plane by 
sampling every two pixels (processing one camera field), the sub-pixel tracking 
results were then scaled to the full image resolution. 



5.2.3 Motion Estimation Results 

The results obtained by the proposed robust approach and the EKF approach are 
shown in Fig. 3. Most of the motion is in the Z direction. The final position was at 
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approximately Z=-60cm. The robust approach estimates the value of Z=-61cm at the 
end of the motion (there is some uncertainty regarding the exact location of the 
camera focal center), the estimated Y coordinate is almost constant and equal to the 
camera lens center height above the floor (about -7.4 cm). The trajectory estimated by 
the EKF is totally different with Z ~ -80 cm and Y ~ 30 cm at the end of the motion. 

The deviation from the expected final camera position is by two orders of magnitude 
higher than the expected experimental accuracy, the EKF estimation is therefore 
erroneous. After observing the tracking results of all the feature points, the points 
1,2, 4, 5, 6, 8 were manually selected as the inlier points (those which reasonably track 
the appropriate object throughout the sequence). Running again both estimators with 
only the inlier points, the proposed approach results are almost unchanged, while the 
EKF estimation changes drastically, now producing a trajectory similar to the robust 
approach (see Fig. 4). It should be noted that the EKF estimation produces a smoother 
trajectory. Image plane errors between the measurements and the projected 3D 
structure are shown in Fig. 5 (corresponding to the motion estimation of Fig. 3). The 
robust method exhibits low errors for most of the features and allows high errors for 
the outliers (this implies that algorithm can automatically separate the inliers from the 
outliers by checking the projection errors). The EKF approach on the other hand 
exhibits large errors for both inlier and outlier features. It should be noted that the 
outlier features are distracted from the true object due to its small size, noise, similar 
objects in the background and reflections from the shiny floor. It is possible to 
improve the feature tracking results by using methodologies from [14], [21], [24], but 
good feature tracking should be complemented with a robust methodology in order to 
compensate for occasional mistakes. Although the deficiencies of the EKF approach 
are mentioned in [5], [6], [7], no examples are given and no remedies are suggested in 
the camera motion estimation literature. As anonymous reviewers have suggested, we 
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Fig. 3. Estimated camera 3D trajectory using the proposed approach ( upper row ) and the EKF 
approach ( lower row) 







Causal Camera Motion Estimation 129 




Fig. 4. Estimated camera 3D trajectory using only the inlier points, the proposed approach 
( upper row ) and the EKF approach ( lower row) 




Fig. 5. Image plane errors. Robust approach showing the 6 inliers (left) and 3 outliers (middle). 
EKF approach with all 9 features (right) 

examined two methods of making the EKF solution more robust: 1 . By incorporating 
measurements only from features which have a geometric error norm below a 
threshold and 2. By applying the robust distance function on the norm of the 
geometric error of each feature. Both failed to improve the results of the EKF. 
Rejection of outliers in Kalman filtering may succeed if the outliers appear scarcely or 
when their proportion is small. In our example these conditions are clearly violated. 



5.2.4 Structure Computation Example 

Structure of unknown features in the scene can be recovered using the estimated 
camera motion obtained by the robust method and the image measurements in a 
partially decoupled scheme as explained in section 4.2. As an example, consider the 
middle of the letter B appearing on the air conditioner which was tracked from frame 
0 to frame 50 (it is occluded shortly afterwards). The reconstructed location of this 
point in the world axes is: X=42.3cm; Y=-45.2cm; Z=159.6cm. The tape measure 
world axes location is: X=42.0cm; Y=-43.7cm; Z= 155cm. The X, Y, Z differences 
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are: 0.3, 1.5 and 4.6 [cm] respectively. As expected, the estimation error is larger 
along the optical axis (approximately the world’s Z axis). The accuracy is reasonable, 
taking into account the short baseline of 19cm produced during the two seconds of 
tracking this feature (the overall translation from frame 0 to frame 50). As discussed 
in [5], a long baseline improves the structure estimation accuracy when the 
information is properly integrated over time. 



6 Conclusion 

A robust framework for camera motion estimation has been presented with extensions 
to the solution of the SLAM problem. The proposed algorithm can tolerate about 33% 
of outliers and it is superior in robustness relative to the commonly used EKF 
approach. It has been shown that a small number of visible features with known 3D 
structure are enough to determine the 3D pose of the camera. It may be implied from 
this work that some degree of decoupling between the motion estimation and structure 
recovery is a desirable property of SLAM algorithms which trades some accuracy loss 
for increased robustness. The robust distance function used in this work is symmetric 
for all the features with the underlying assumption that the probability of a feature to 
be an inlier or an outlier is independent of time. However, in most cases, a feature is 
expected to exhibit a more consistent behavior as an outlier or an inlier. This property 
may be exploited for further improvement of the algorithm’s robustness and accuracy. 
Also, an interesting question for future work is: How to construct fiducials which can 
be quickly and accurately identified in the scene for camera localization purposes. 



References 



1. A. Azarbayejani and A. Pentland. Recursive Estimation of Motion, Structure and Focal 
Length. IEEE Trans. PAMI, Vol. 17, no. 6, pp. 562-575, 1995. 

2. M. J. Black and P. Anandan. The Robust Estimation of Multiple Motions: Parametric and 
Piece-wise Smooth Flow Fields. CVIU, Vol. 63, No. 1, 1996. 

3. M. J. Black and A. Rangarajan. On the Unification of Line Processes, Outlier Rejection, 
and Robust Statistics with Applications in Early Vision. IJCV 19(1), 57-91, 1996. 

4. T.J. Broida, S. Chandrashekhar, and R. Chellappa. Recursive 3-d motion estimation from a 
monocular image sequence. IEEE Trans, on AES., 26(4):639- 656, 1990. 

5. A. Chiuso, P. Favaro, H. Gin. and S. Soatto. Structure from Motion Causally Integrated 
Over Time, IEEE. Trans, on PAMI, Vol. 24, No. 4, 2002. 

6. A. J. Davison. Real-Time Simultaneous Localization and Mapping with a Single Camera. 
ICCV 2003. 

7. A. J. Davison and D. W. Murray. Simultaneous Localization and Map-Building using 
Active Vision. PAMI, Vol. 24, No. 7, July 2002. 

8. O. Faugeras. Three Dimensional Computer Vision, a Geometric Viewpoint. MIT Press, 
1993. 

9. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision, Cambridge 
press 2000. 

10. T.S. Huang and A.N. Netravali. Motion and Structure from Feature Correspondences: a 
review. Proceeding of The IEEE Communications of the ACM, 82(2): 252-268, 1994. 




Causal Camera Motion Estimation 



131 



11. X. Hu and N. Ahuja. Motion and Structure Estimation Using Long Sequence Motion 
Models. Image and Vision Computing, Vol. 11, no. 9, pp. 549-569, 1993. 

12. M. Isard and A. Blake. Visual Tracking by Stochastic Propagation of Conditional Density. 
Proc. 4 th ECCV, Pages 343-356. 

13. M. Isard and A. Blake. CONDENSATION - Conditional Density Propagation for Visual 
Tracking, Int. J. Computer Vision, 29, 1, 5-28, 1998. 

14. H. Jin, P. Favaro and S. Soatto. Real-time Feature Tracking and Outlier Rejection with 
Changes in Illumination, ICCV, July 2001. 

15. H. Jin. Code from the web site: http://ee.wustl.edu/~hljin/research/ 

16. B.D. Lucas and T. Kanade, An iterative Image Registration Technique with an Application 
to Stereo Vision, In IJCAI81, pages 674-679, 1981. 

17. J. MacCormick and M. Isard, Partitioned Sampling, Articulated Objects and Interface- 
Quality Hand Tracking. Proc. Sixth European Conf. Computer Vision, 2000. 

18. M. Montemerlo and S. Thrun. Simultaneous Localization and Mapping with Unknown 
Data Association using FastSLAM. Proc. ICRA, 2003, to appear 

19. M. Montemerlo, S. Thrun, D. Roller, and B. Wegbreit. FastSLAM: A Factored Solution to 
the Simultaneous Localization and Mapping Problem. In AAAI-2002. 

20. P.J. Huber. Robust Statistics. Wiley 1981. 

21. J. Shi, C. Tomasi. Good Features to Track. CVPR '94, June 1994, pub. IEEE, pp. 593-600. 

22. M. Spetsakis and J. Aloimonos. A Multi-Frame Approach to Visual Motion Perception. 
Int. J. Computer Vision, Vol. 6, No. 3, pp. 245-255, 1991. 

23. S. Thrun, W. Burgard and D. Fox. A Real-Time Algorithm for Mobile Robot Mapping 
With Applications to Multi-Robot and 3D Mapping. IEEE. International Conference on 
Robotics and Automation. April 2000. 

24. T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto. Making Good Features Track 
Better. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 178-183, 1998. 

25. Web site: http://www.cs.technion.ac.il/~taln/ 




An Adaptive Window Approach for Image 
Smoothing and Structures Preserving 



Charles Kervrann 



IRISA - INRIA Rennes / INRA - Mathematiques et Informatique Appliquees 
Campus de Beaulieu, 35042 Rennes Cedex, France 

ckervran@irisa . f r 



Abstract. A novel adaptive smoothing approach is proposed for image 
smoothing and discontinuities preservation. The method is based on a 
locally piecewise constant modeling of the image with an adaptive choice 
of a window around each pixel. The adaptive smoothing technique asso- 
ciates with each pixel the weighted sum of data points within the window. 
We describe a statistical method for choosing the optimal window size, 
in a manner that varies at each pixel, with an adaptive choice of weights 
for every pair of pixels in the window. We further investigate how the 
I-divergence could be used to stop the algorithm. It is worth noting the 
proposed technique is data-driven and fully adaptive. Simulation results 
show that our algorithm yields promising smoothing results on a variety 
of real images. 



1 Introduction 

The problem of image recovering is to reduce undesirable distortions and noise 
while preserving important features such as homogeneous regions, discontinu- 
ities, edges and textures. Popular image smoothing algorithms are therefore 
nonlinear to reduce the amount of smoothing near abrupt changes, i.e. edges. 
Most of them are based on discrete [3] or continuous [18,21] energy functionals 
minimization since they are designed to explicitly account for the image geome- 
try. In recent years, a large number of partial differential equations (PDE) and 
variational methods, including anisotropic diffusion [20,4] and total variation 
(TV) minimization [21], have shown impressive results to tackle the problem of 
edge-preserving smoothing [5,4,6,17] and to separate images into noise, texture 
and piecewise smooth components [16,19]. 

In this paper, we also address the adaptive image smoothing problem and 
present a nonparametric estimation method that smooth homogeneous re- 
gions and inhibits smoothing in the neighborhood of discontinuities. The pro- 
posed adaptive window approach differs from previous energy minimization-based 
methods [3,18,21]. It is conceptually very simple being based on the key idea of 
estimating a regression function with an adaptive choice of the window size 
(neighborhood) in which the unknown function is well approximated by a con- 
stant. At each pixel, we estimate the regression function by iteratively growing a 
window and adaptively weighting input data to achieve an optimal compromise 
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between the bias and variance [14,15,13]. The motivation behind this nonpara- 
metric estimation approach is to use a well-established theory in statistics [10,14] 
for adaptive smoothing, yielding to non-iterative algorithms for 2D-3D imaging. 
The proposed algorithm complexity is actually controlled by simply restricting 
the size of the larger window and setting the window growing factor. In contrast 
to most digital diffusion-based filtering processes for which the input noisy im- 
age is “abandoned” after the first iteration [20,4], the adaptive window approach 
recycles at each step the original data. Other related works to our approach are 
nonlinear Gaussian filters (iterative or non-iterative bilateral filtering [12,22,7, 
1,23]) that essentially average values within a local window but changes the 
weights according to local differences in the intensity [12,22,7,1,23]. However, 
these weighted schemes use a static window size which can be arbitrarily large, 
in the both spatial and range domains. Our structure-adaptive smoothing also 
works in the joint spatial-range domain but has a more powerful adaptation to 
the local structure of the data since the size of the window and internal param- 
eters are computed from local image statistics. 

2 A Statistical Nonparametric Approach 

We observe the function u with some additive errors 

Y i = u(x i )+€i > i = 1, - ■ ■ ,n, ( 1 ) 

were x t € R 2 represents the spatial coordinates of the discrete image domain S 
of n pixels and Yi G M is the observed intensity at location X{. We suppose the 
errors & to be independent and distributed zero-mean random variables with 
unknown variances, i.e. var(^) = of. 



2.1 Image Model and Basic Idea 



A classical nonparametric estimation approach is based on the structural as- 
sumption that regression function u(x) is constant in a local neighborhood in 
the vicinity of a point x. An important question under such an approach is first 
how to determine for each pixel the size and shape of the neighborhood under 
concern from image data. The regression function u(x) can be then estimated 
from the observations lying in the estimated neighborhood of x by a local max- 
imum likelihood (ML) method. 

Our procedure is iterative and uses this idea. First, suppose we are given a 
local window containing the point of estimation Xi. By we denote an 
approximation of u^(xi). We can calculate an initial ML estimate u[ 0 ^ at point 
Xi (and its variance $^) by averaging observations over a small neighborhood 
of Xi as 
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where of is a pilot estimator which can be plugged in place of the noise variance 
of and |W i (0) | denotes the number of points Xj G W^. At the next iteration, a 
larger neighborhood with C centered at Xi is considered, and 

every point Xj from VV^ gets a weight wff which is defined by comparing the 
estimates and obtained at the first iteration. Then we recalculate the 

estimate as the weighted average of data points lying in the neighborhood 

(1) (k) 

ytf . We continue this way, growing with k the considered neighborhood ; 

for each k > 1, the ML estimate u\ ; and its variance are given by: 
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where weights wff are continuous variables (0 < < 1), computed by com- 

parison of the preceding estimates u[ k ^ and u^ k 1 \ Note we can also write 
u • ^ = Y^j=i 1{ X . evy (fe)}^j Yj where 1^^ eVV ( fc ) > is the spatial rectangular ker- 
nel. In the following, we use a spatial rectangular kernel (square windows) for 
mathematical convenience, but the method can be easily extended to the case 
of a more usual spatial Gaussian kernel [12,22,7,1,23]. Moreover, we choose an 

^(k) 

optimal window for each pixel Xi by comparing the new estimate u\ ; with the 
estimate u\ } obtained at the preceding iteration. Finally, a global convergence 
criterion is introduced to stop the estimation procedure. 

At this level, an important connection between our method and local robust 
M-estimation [7] should be mentioned. In Equation (3), the weight function 
does not depend on input data but are only calculated from neighboring local 
estimates, which contributes to the regularization performance. In addition, the 
method is able to recover a piecewise smooth image even the underlying image 
model is locally constant as it is confirmed in our experiments (see Fig. 2). The 
approach is similar also (at least in spirit) to twofold weighting schemes employed 
in the bilateral filtering [22,1,23] and mean shift - based algorithms [8]. 



2.2 Adaptive Weights 

In our approach, we may decide on the basis of the estimates u\ k ~^ and 

(k) 

whether a point Xi and Xj G Wf J are in the same region and then prevent 
significant discontinuities oversmoothing. In the local Gaussian case, significance 
is measured using a contrast \u^ k ~^ — Uj k ~^\ . If this contrast is high compared 

to fW 1) » then Xj is an outlier and should not participate to the estimation 
of and 0. 

L LJ 

Hence, motivated by the robustness and smoothing properties of the Huber 
M-estimator in the probabilistic approach of image denoising [2], we introduce 
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the following related weight function (but other weight functions are possible 
[4,23]) 

( 4 ) 



Here Ay^ ^ is related to the spatially varying fraction of contamination 
of the Gaussian distribution: for the majority of points Xj G W*, the values 
u\ k ~^ — Uj k ~^ can be approximatively modeled as being constant (zero) with 
random Gaussian noise. Hence A is an appropriate quantile of the standard nor- 
mal distribution and depends on the level of noise in images. In our experiments, 
we arbitrarily set A = 3 according to the well known “rule of 3 sigma”. 
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2.3 Optimal Window Selection 



Statistical inference under such a structural assumption focuses on searching for 
every point Xi the largest neighborhood (window) W* where the hypothesis of 
structural homogeneity is not rejected. In other words, we seek to estimate a 
regression function u from the data, while having deal with a so-called nuisance 
parameter, that is the neighborhood. The classical measure of the closeness of 
the estimator u obtained in the window W* to its target value u is the mean 
squared error (MSE) which is decomposed into the sum of the squared bias 
[Bias(iq)] 2 and variance 

As explained before, we should choose a window that achieves an optimal 
compromise between the squared bias and variance for each pixel. Accordingly, 
we make the reasonable assumption that the squared bias is an increasing func- 
tion of the neighborhood size and the variance is a decreasing function of the 
neighborhood size. Then, in order to minimize the MSE we search for the win- 
dow where the squared bias and the variance of the estimate are equal. The 
corresponding critical MSE is (E[-] denotes the mathematical expectation): 
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Now, let us introduce a finite set of windows {W^, • • • , u[ kM ^} centered at 
Xi G 5, with C w\ kJrl \ starting with a small and the correspond- 

ing estimates u\ 0 ^ of the true image u(xi). Denote by ki the ideal window size 
corresponding to the minimum value of the pointwise MSE at location X{. Ac- 

(k) 

cordingly, W- can be obtained according to the following statistical pointwise 
rule [14,15,13]: 



max 



jfc : W <k: 
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In other words, as long as successive estimates u\ ’ stay close to each other, 
we decide that the bias is small and the size of the estimation window can be 
increased to improve the estimation of the constant model (and to decrease the 
variance of the estimate u[ k ^). If an estimated point u[ k ^ appears far from the 
previous ones, we interpret this as the dominance of the bias over the variance 
term. For each pixel, the detection of this transition enables to determine the 
critical size that balances bias and variance. Note the choice of the detection 
threshold in (6) between 2 2 and 4 2 does not change the result of the algorithm 
significantly. 



2.4 Global Stopping Rule 

A stopping rule can be used to save computing time if two successive solutions 
are very close and prevent an useless setting of the larger window size. Here, we 
adopt the so-called Csiszar’s I-divergence [9] to detect global convergence: 

n 

I(u {k \u (k+1) ) = 

i= 1 

We choose this criterion to obtain the distance between succeeding iterations 
since the decorrelation criterion proposed in [17] tends to underestimate the 
number of necessary iterations in our framework. In addition, the I-divergence 
criterion can be used for a variety of algorithms, as it does not directly depend 
on the restoration method. In practice, the I-divergence is normalized with its 
maximal occurring value at iteration k = 0. When I(u^ k \ u^ k+1 ^) sinks under a 
threshold (of the order 10 -3 for typical images) that sufficiently represents con- 
vergence, the algorithm is stopped at the final iteration k c = k , with k c < • 

Finally, the window size increases at each iteration k if the global convergence 
criterion is not met (or k < &m) without changing the estimate if the point- 

wise rule (6) has already been violated at X { , i.e. u , = u[ k ^ if ki < k. If the rule 
(6) has not been violated at we have ki = k where k is the current iteration 
of the algorithm. 

3 Implementation 

The key ingredient of the procedure is an increasing sequence of neighborhoods 
w[ k \k = 0, 1, • • • , kj\/[ with C w\ k+1 ^ centered at each image pixel X{. 

In what follows, |yi^| denotes the number of points Xj in w[ k \ i.e. |W^| = 
#{xj G W[ k ^}. In our experiments, we arbitrarily use neighborhoods cor- 

responding to successive square windows of size |W^| = (2 k + 1) x (2k + 1) 
pixels with k = 0, 1 , 2, ... , fc/vf • 

The estimation procedure described in Section 2.1 relies also on the pre- 
liminary local estimation of the noise variance. This estimation is an (off-line) 
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pre-processing step to initialize the adaptive smoothing procedure. In most ap- 
plications, the noise variance a 2 at each image point Xi is unknown and an 
estimate a 2 can be obtained from data as 



1 

W\ 



E 4 

x j e is i 



(8) 



where Bi denotes a square block of pixels centered at Xi and local residuals ej 
are defined as (we note Yj 1 j 2 the observation Yj at site j = (ji , J 2 )) [10]: 

_ — (Xj\ AI5J2 “b -^ii — i,j2 *b "b Yjij 2 — 1) / n \ 

£i “ 71S ' (9) 

In presence of discontinuities in the block Bi , a local estimate of the noise variance 
based on robust statistics is preferable. In this framework, high discontinuities 
within the region Bi correspond to statistical outliers with respect to local image 
contrasts. As in [4], we suggest to define a 2 = max(5 2 ,5^.) where a 2 and are 
respectively the robust estimates of the noise variance computed for the entire 
image and within a local block centered at point Xi. We propose the following 
robust estimates for a and erg. : 



a = 1.4826 median (| \e s \ — median^l |) (10) 

erg. = 1.4826 median (| \e B . \ — median|5g. | |) (11) 

where s s = {£i,£ 2> * • • ,£ n } is the set of n local residuals of the entire image, 

e B . is the set of \Bi\ local residuals within the local block Bi and the constant 

is derived from the fact that the median absolute deviation of a zero-mean nor- 
mal distribution with unit variance is 0.6745 = 1./1.4826. The local estimation 
of noise variance is useful for filtering off spatially varying textures. The 
global estimate of the noise variance provides also a reasonable lower bound and 
prevents the amplification of noise in relatively homogeneous areas. From a prac- 
tical point of view, we have explored the computation of the local estimate of 
the noise variance within blocks Bi of size (2&m + 1) x (2&m + 1)> be. Bi = 
at every location Xi in the image. 

Below, we give the proposed algorithm: 

— Initialization: For each point we calculate initial estimates and 

using Equation (2) and set km 1. We naturally choose iWb^l = 1, 
i.e. the initial local neighborhood contains only x^. Here a 2 is the 

noise variance robustly estimated at each point Xi from data as it has been 
explained before. 

(k) (k) 

— Estimation: For all Xj in , we compute weights using Equation 
(4) and new estimates u[ k ^ and using Equation (3). 

— Pointwise control: After the estimate u\ ; has been computed, we compare 
it to the previous estimates uf ^ at the same point Xi for all k' < k. If the 
pointwise rule (6) is violated at iteration fc, we do not accept and keep 
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the estimates u[ k ^ from the preceding iteration as the final estimate at X{ 
(i.e. ki = k — 1 at point Xi). The estimate at X{ is unchanged if k > k{. 

— Convergence: Stop the procedure if k = kM or if I(u^ k \ u^ k+1 ^) < 10 -3 , 
otherwise increase k by 1 and continue with the estimation step. We use the 
parameter kM to bound the numerical complexity. As expected, increasing 
kM allows for additional variance reduction in homogeneous regions but 
usually does not change the estimates in the neighborhood of discontinuities. 
In our experiments, kM = 15 satisfies a good compromise and over-estimates 
the number of necessary iterations. 

Using this algorithm, it can be shown that the average gray level of 
the smoothed image is not affected by the adaptive window procedure, i.e. 

n X^=i = n X^=i an d the extremum principle is guaranteed: mim, Yj < 
gW < maxj Yj, \/xi <G «S, Vfc < &m- 



4 Image Decomposition 

The proposed adaptive window approach is able to extract a piecewise smooth 
image, but significant textures and oscillating patterns are actually excluded. 
The purpose of this section is to briefly show that a relatively simple modification 
of the above global procedure yields an algorithm able to separate images into 
noise, texture and piecewise smooth (or cartoon) parts. In general the image- 
signal is assumed to be corrupted by an additive zero-mean white Gaussian noise 
with a constant variance. Therefore, the key idea is first to remove noise from the 
input image by setting <5^ = a in the estimation procedure described in Section 
3.1. If the original image consists of three additive components (noise, texture, 
cartoon) [16,19,11], the texture component is simply obtained by computing the 
difference between the noise-free image and the piecewise smooth image. The 
piecewise smooth image is estimated as described earlier in the paper, i.e by 
considering local estimations of the noise-texture variances a 2 in the procedure. 
In areas containing little texture this difference is close to zero since cf|. is likely 
to be less than a 2 in these areas. While the experimental results are promising 
using this simple mechanism, we do not pretend that it is able to decompose any 
images into the three main components under concern ([16,19,11]). 

5 Experiments 

The potential of the adaptive window method is first shown on a synthetic- 
natural image (Fig. la) corrupted by an additive white- Gaussian noise (Fig. 
lb, PSNR = 28.3 db, a = 10). In Fig. lc, the noise and small textures are 
reduced in a natural manner and significant geometric features such as corners 
and boundaries, and original contrasts are visually well preserved (PSNR = 31 
db). In this experiment, the final iteration k c = 11 was determined autonomously 
by the I-divergence criterion (kM = 15). To enhance the visibility of the results, 
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Fig. 1. An example of adaptive smoothing results. 




Fig. 2. A horizontal cross-section corresponding to line 128 drawn in white in Fig. la. 



we have computed the gradient norm on each of the three images; in Figs, ld-f, 
high gradient values are coded with black and zero gradient with white. To better 
appreciate the smoothing process, a horizontal cross-section marked in white 
in Fig. la is graphically depicted in Fig. 2. The abrupt changes are correctly 
located and satisfying smooth variations of the signal are recovered (Fig. 2b). 
The processing of a 256 x 256 image required typically 15 seconds on a PC (2.6 
Ghz, Pentium IV) using a standard C++ implementation of the algorithm. 
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(e) part of the cartoon-like image using our (f) part of the cartoon-like image by TV mini- 
method mization [21] 

Fig. 3. Processing of the noisy 512 x 512 Barbara image. 
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(a) noisy signal (b) cartoon component (c) texture component 

Fig. 4. A horizontal cross-section corresponding to line 256 drawn in Fig. 3a. 



In a second example, the effects of the adaptive window approach are illus- 
trated using the well-known 512 x 512 Barbara image (Fig. 3). It turns out that 
the results of the adaptive window approach and TV minimization are visually 
similar when they are applied on the original image corrupted by an additive 
white-Gaussian noise (Fig. 3a, a = 10). In this experiment, the noise- free image 
shown in Fig. 3b, has been obtained by setting di = a in the adaptive win- 
dow procedure. Additionally, our method is also able to reliably estimate the 
piecewise smooth component as shown in Fig. 3c and Fig. 3e. Small textures 
on clothes in the original image are nearly eliminated after k c = 11 iterations 
(automatically detected). In Fig. 3f, the TV minimizing process [21] does not 
completely eliminate small textures without blurring edges even if the balance 
between the fidelity and regularizing terms are modified. Finally, the estimated 
texture component shown in Fig. 3d correspond to the difference between the 
piecewise smooth image (Fig. 3b) and the noise- free image (Fig. 3a). In Fig. 
4, the horizontal cross-section marked in white in Fig. 3a is depicted to better 
appreciate the image decomposition. The method provides also some additional 
structural information about the image. Figure 5a shows the results of local esti- 
mations of the noise-texture variance af within local windows (&m = 15). Dark 
areas have higher values of of and correspond to more textured image regions; 
bright areas correspond to uniform regions, i.e. of = a 2 . Figure 5b shows the 
locations and sizes of optimal estimation windows; we have coded small windows 
with black and large windows with white. As expected, small windows are in the 
neighborhood of image gradients shown in Fig. 5e. The histogram of the windows 
sizes is also shown in Fig. 5f. Finally, Figures 5c and 5d show respectively the 

values {|W^^| -1 9ij^} an d } where dark values correspond to 

image discontinuities (Fig. 5c) and to low-confidence estimates (Fig. 5d). 

We have also tested the algorithm (also implemented for processing 3D data) 
on a 3D confocal fluorescence microscopy image that contain complex structures. 
A typical 2D image taken from the 3D stack of 80 images depicts glomeruli in 
the antennal lobes of the moth olfactory system (Fig. 6a). The smoothed 3D 
image have larger homogeneous areas than the original 3D image (Fig. 6b). The 
corresponding noise-free texture image is shown in Fig. 6c. Here, this image 
decomposition is useful to extract regions/ volumes of interest. 
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(a) local estimations of noise 
and texture variance 




(d) 



'5 iki) 

variances te % 



(b) locations/sizes of windows 




(e) gradient of the cartoon 
image 



(c) image discontinuities 
(see text) 




(f) histogram of the window 
sizes 



Fig. 5. Results of the cartoon-like 512 x 512 Barbara image. 




(a) original image (b) cartoon image (c) noise-free texture component 

Fig. 6. Decomposition of a 3D confocal microscopy image. 



6 Summary and Conclusions 

We have described a novel feature-preserving adaptive smoothing algorithm 
where local statistics and variable window sizes are jointly used. Since |W^)| 
grows exponentially in our set-up, the whole complexity of the proposed algo- 
rithm is of order 0(n|W^ c ^|) if an image contains n pixels and k c < &m- In 
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addition, the proposed smoothing scheme provides an alternative method to the 
anisotropic diffusion and bilateral filtering or energy minimization methods. An 
advantage of the method is that internal parameters can be easily calibrated us- 
ing statistical arguments. Experimental results show demonstrate its potential 
for image decomposition into noise, texture and piecewise smooth components. 
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Abstract. The exploitation of video data requires to extract informa- 
tion at a rather semantic level, and then, methods able to infer “con- 
cepts” from low-level video features. We adopt a statistical approach 
and we focus on motion information. Because of the diversity of dy- 
namic video content (even for a given type of events), we have to design 
appropriate motion models and learn them from videos. We have de- 
fined original and parsimonious probabilistic motion models, both for 
the dominant image motion (camera motion) and the residual image 
motion (scene motion). These models are learnt off-line. Motion mea- 
surements include affine motion models to capture the camera motion, 
and local motion features for scene motion. The two-step event detection 
scheme consists in pre-selecting the video segments of potential interest, 
and then in recognizing the specified events among the pre-selected seg- 
ments, the recognition being stated as a classification problem. We report 
accurate results on several sports videos. 



1 Introduction and Related Work 

Exploiting the tremendous amount of multimedia data, and specifically video 
data, requires to develop methods able to extract information at a rather seman- 
tic level. Video summarization, video retrieval or video surveillance are examples 
of applications. Inferring concepts from low-level video features is a highly chal- 
lenging problem. The characteristics of a semantic event have to be expressed 
in terms of video primitives (color, texture, motion, shape ...) sufficiently dis- 
criminant w.r.t. content. This remains an open problem at the source of active 
research activities. 

In [9], statistical models for components of the video structure are introduced 
to classify video sequences into different genres. The analysis of image motion 
is widely exploited for the segmentation of videos into meaningful units or for 
event recognition. Efficient motion characterization can be derived from the op- 
tical flow, as in [8] for human action change detection. In [11], the authors use 
very simple local spatio-temporal measurements, i.e., histograms of the spatial 
and temporal intensity gradients, to cluster temporal dynamic events. In [10], a 
principal component representation of activity parameters (such as translation, 
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rotation ...) learnt from a set of examples is introduced. The considered ap- 
plication was the recognition of particular human motions, assuming an initial 
segmentation of the body. 

In [2] , video abstraction relies on a measure of fidelity of a set of key- frames based 
on color descriptors and a measure of summarizability derived from MPEG- 7 de- 
scriptors. In [6], spatio-temporal slices extracted in the volume formed by the 
image sequence are exploited both for clustering and retrieving video shots. Sport 
videos are receiving specific attention due to the economical importance of sport 
TV programs and to future services to be designed in that context. Different 
approaches have been recently investigated to detect highlights in sport videos. 
Dominant colour information is used in [3]. 

In this paper, we tackle the problem of inferring concepts from low-level video 
features and we follow a statistical approach involving modeling, (supervised) 
learning and classification issues. Such an attempt was recently undertaken for 
static images in [5]. We are dealing here with concepts related to events in videos, 
more precisely, to dynamic content. Therefore, we focus on motion information. 
Since no analytical motion models are available to account for the diversity of 
dynamic contents to be found in videos, we have to specify and learn them from 
the image data. To this end, we introduce new probabilistic motion models. Such 
a probabilistic modelling allows us to derive a parsimonious motion representa- 
tion while coping with errors in the motion measurements and with variability 
in motion appearence for a given type of event. We handle in a distinct way the 
scene motion (i.e. , the residual image motion) and the camera motion (i.e. , the 
dominant image motion) since these two sources of motion bring important and 
complementary information. As for motion measurements, we consider, on one 
hand, parametric motion models to capture the camera motion, and on the other 
hand, local motion features to account for the scene motion. They convey more 
information than those used in [11], while still easily computable contrary to 
optic flow. They can be efficiently and reliabily computed in any video whatever 
its genre and its content. 

We have designed a two-step event detection method to restrict the recognition 
issue to a limited and pertinent set of classes since probabilistic motion models 
have to be learnt for each class of event to be recognized. This allows us to 
simplify the learning stage, to save computation time and to make the overall 
detection more robust and efficient. The first step consists in selecting candidate 
segments of potential interest in the processed video. Typically, for sport videos, 
it involves to select the “play” segments. The second step handles the recogni- 
tion of the relevant events (in terms of dynamic content) among the segments 
selected after the first step and is stated as a classification problem. 

The remainder of the paper is organized as follows. In Section 2, we briefly 
present the motion measurements we use. Section 3 is concerned with the prob- 
abilistic models introduced to represent the dominant image motion and the 
residual motion. We describe in Section 4 the two-step event detection method, 
while the learning stage is detailed in Section 5. Experiments on sports videos 
are reported in Section 6, and Section 7 contains concluding remarks. 
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2 Motion Measurements 



Let us first briefly describe the motion measurements that we use. Let us point 
out that the choice of these measurements is motivated by the goal we are 
pursuing, that is the recognition of important events in videos. This task is 
intended as a rather qualitative characterization which does not require a full 
estimation of the image motion. 

It is possible to characterize the image motion as proposed in [4] , by computing 
at each pixel a local weighted mean of the normal flow magnitude. However, the 
image motion is actually the sum of two motion sources: the dominant motion 
(which can be usually assumed to be due to camera motion) and the residual 
motion (which is then related to the independent moving objects in the scene, 
which will be referred to as the scene motion in the sequel). More information 
can be recovered if we explicitly consider these two motion components rather 
than the total motion only. Thus, we first compute the camera motion (more 
precisely, we estimate the dominant image motion) between successive images of 
the sequence. Then, we cancel the camera motion (i.e., we compensate for the 
estimated dominant image motion), which allows us to compute local motion- 
related measurements revealing the residual image motion only. 

The dominant image motion is represented by a deterministic 2D affine motion 
model which is a usual choice: 



w eip) = 



f a\+ a 2 x + asy 
y n 4 T a 5 x + a 6 y 



(1) 



where 0 = (eq , i = 1, . . . , 6) is the model parameter vector and p — (x,y) is an 
image point. This simple motion model can correctly handle different camera 
motions such as panning, zooming, tracking, (including of course static shots). 
Different methods are available to estimate such a motion model. We use the 
robust real-time multiresolution algorithm described in [7] . Let us point out that 
the motion model parameters are directly computed from the spatio-temporal 
derivatives of the intensity function. Thus, the camera-motion flow vector (p) 
is available at each time t and for each pixel p. 

Then, the residual motion measurement v res (p,t) is defined as the local mean 
of the magnitude of normal residual flows weighted by the square of the norm 
of the spatial intensity gradient. The normal residual flow magnitude is given 
by the absolute value of the Displaced Frame Difference DFD § , evaluated with 
the estimated dominant motion, and divided by the norm of the image spatial 
gradient. We finally get: 

(2) 

m&x(r) 2 ,J2 qe r {q ) l|V/(^)|| 2 J 

where DFD^(q) = I (q + w § (q),t + 1) — /(g, t). F(p) is a local spatial window 
centered in pixel p (typically a 3 x 3 window). V/(g, t) is the spatial intensity 
gradient of pixel q at time t. p 2 is a predetermined constant related to the noise 
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Fig. 1 . Athletics video: First row: four images of the video. Second row: the correspond- 
ing maps of dominant image motion supports (inliers in white, outliers in black). Third 
row: local residual motion measurements v re s (zero- value in black). 



level. Such measurements have already been used for instance for the detection 
of independent moving objects in case of a mobile camera. Figure 1 respectively 
displays images of an athletic TV program, the corresponding maps of dominant 
motion support (i.e., the points belonging to the image parts undergoing the 
estimated dominant motion) and the corresponding maps of residual motion 
measurements. This example shows that the camera motion is reliably captured 
even in case of multiple moving elements in the scene since the static background 
is correctly included in the inliers. It also indicates that the scene motion is 
correctly accounted for by the residual motion measurements. From relation (2), 
it can be straightforwardly noted that we only get information related to motion 
magnitude, and consequently, we lose the motion direction. As demonstrated by 
the results reported later, this is not a shortcoming since we aim at detecting 
events, i.e., at determining “qualitative” motion classes. Furthermore, it allows 
us to manipulate scalar measurements. 



3 Probabilistic Modelling of Motion 

The proposed method for the detection of important dynamic events relies on 
the probabilistic modelling of the motion content in a video. Indeed, the large 
diversity of video contents leads us to favor a probabilistic approach which more- 
over allows us to formulate the problem of event recognition within a Bayesian 
framework. Due to the different, nature of the information brought by the resid- 
ual motion (scene motion) and by the dominant motion (camera motion), two 
different probabilistic models are defined. 
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3.1 Residual Motion 

We first describe the probabilistic model of scene motion derived from statis- 
tics on the local residual motion measurements expressed by relation (2). The 
histograms of these measurements computed over different video segments were 
found to be similar to a zero-mean Gaussian distribution (a truncated version 
since we deal with positive values only, by definition v res (p,t) > 0) except a 
usually prominent peak at zero. Therefore, we model the distribution of the lo- 
cal residual motion measurements within a video segment by a specific mixture 
model involving a truncated Gaussian distribution and a Dirac distribution. It 
can be written as: 

Are. (7) = ^0(7) + (1 -/5)<M7;0,a 2 )l 77 to(7), (3) 

where [3 is the mixture weight, £0 denotes the Dirac function at 0 (#0(7) = 1 if 
7 = 0 and £0(7) = 0 otherwise) and (f) t { 7; 0 , a 2 ) denotes the truncated Gaussian 
density function with mean 0 and variance a 2 . 1 denotes the indicator function 
(l^o = 1 if 7 7^ 0 and Ly^ 0 = 0 otherwise). Parameters (3 and a 2 are estimated 
using the Maximum Likelihood criterion (ML) . In order to capture not only the 
instantaneous motion information but also its temporal evolution over the video 
segment, the temporal contrasts Av res of the local residual motion measure- 
ments are also considered: Av res (p,t) = v res (p,t + 1) — v res (p,t). They are also 
modeled by a mixture model of a Dirac function at 0 and a zero-mean Gaussian 
distribution, but the Gaussian distribution is not truncated here. The mixture 
weight and the variance of the Gaussian distribution are again evaluated using 
the ML criterion. 

The full probabilistic residual motion model is then defined as the product of 
these two models as follows: PM res ( v res, Av res ) = P(v res ).P(Av res ) The prob- 
abilistic residual motion model is completely specified by four parameters only 
which are moreover easily computable. Obviously, this model does not allow 
us to capture how the motion information is spatially distributed in the image 
plane, but this is not necessary for the objective we consider here. 



3.2 Dominant Image Motion 

We have to design a probabilistic model of the camera motion to combine it 
with the probabilistic model of the residual motion in the recognition process. A 
first choice could be to characterize the camera motion by the motion parameter 
vector 9 defined in Section 2 and to represent its distribution over the video 
segment by a probabilistic model. However, if the distribution of the two trans- 
lation parameters a\ and a 4 could be easily inferred (these two parameters are 
likely to be constant within a video segment so that a Gaussian mixture could 
reasonably be used, the task becomes more difficult when dealing with the other 
parameters which may be not constant anymore over a segment. 

We propose instead to consider another mathematical representation of the es- 
timated motion models, that is the camera- motion flow vectors and to consider 
the 2D histogram of these vectors. At each time t, the motion parameters Q t of 
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the camera motion model (1) are available and the vectors w ^ (p) can be com- 
puted at any point p of the image plane (we consider the points of the image grid 
in practice). The values of the horizontal and vertical components of w § (p) are 
then finely quantized, and we form the empirical 2D histogram of their distribu- 
tion over the considered video segment. Finally, this histogram is represented by 
a mixture model of 2D Gaussian distributions. Let us point out that this model- 
ing can involve several different global motions for events of the same type filmed 
in different ways. The number of components of the mixture is determined with 
the Integrated Completed Likelihood criterion (ICL, [1]) and the mixture model 
parameters are estimated using the Expectation-Maximisation (EM) algorithm. 



4 Event Detection Algorithm 

We now exploit the designed probabilistic models of motion content for the task 
of event detection in video. We have to learn the concepts of dynamic content 
to be involved in the event detection task. 

We suppose that the videos to be processed are segmented into homogeneous 
temporal units. This preliminary step is out of the scope of this paper which 
focuses on the motion modelling, learning and recognition issues. To segment 
the video, we can use either a shot change detection technique or a motion- 
based temporal video segmentation method. Let {si}i =lj ... } jv th e partition of 
the processed video into homogeneous temporal segments. 



4.1 Selecting Video Segments 

The first step of our event detection method permits to sort the video segments 
in two groups, the first group contains the segments likely to contain the relevant 
events, the second one is formed by the video segments to be definitively dis- 
carded. Typically, if we consider sport videos, we try to first distinguish between 
“play” and “no play” segments. This step is based only on the residual motion 
which accounts for the scene motion, therefore only single- variable probabilistic 
models are used, which saves computation. To this end, several motion models 
are learnt off-line in a training stage for each of the two groups of segments. This 
will be detailed in Section 5. We denote by {Af^, 1 < n < Ni} the residual 
motion models learnt for the “play” group and by {Af^™, 1 < n < N 2 } the resid- 
ual motion models learnt for the “no play” group. Then, the sorting consists in 
assigning the label Q , whose value can be 1 for “play” or 2 for “no play” , to each 
segment Si of the processed video using the ML criterion defined as follows: 



Ci = arg max 
k= 1,2 



max 

l<n<AT fc 



P M k r fS Z i) 



(4) 



Zi = {v res i, Av res i} denote the local residual motion measurements and their 
temporal contrasts for the video segment S{. 
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4.2 Detecting Relevant Events 

Problem statement. The second step of the proposed method effectively deals 
with the detection of the events of interest within the previously selected seg- 
ments. Contrary to the first step, the two kinds of motion information (scene 
motion and camera motion) are exploited, since their combination permits to 
more precisely characterize a specific event. For a given genre of video document, 
an off-line training stage is required to learn the dynamic content concepts in- 
volved in the event detection task. As explained in Section 5, a residual motion 
model M. J res and a camera motion model M. J carn have to be estimated from a 
given training set of video samples, for each event j to be retrieved. The detec- 
tion is performed in two sub- steps. First, we assign to each pre-selected segment 
the label of one of the event classes introduced in the considered task. This issue 
is stated as a classification problem which avoids the need of detection thresholds 
and allows all the considered events to be extracted in a single process. Since 
false segments might be included in the pre-selected segments, a validation step 
is subsequently applied to confirm or not the assigned labels. 



Video segment labeling. We consider only the segments which have been 
selected as “play” segments after the first step described above. For each video 
segment s*, Zi = {v res Av res are the residual motion measurements and their 
temporal contrasts, and Wi represent the motion vectors corresponding to the 
2D affine motion models estimated between successive images over the video 
segment s*. 

The video segments are then labeled with one of the J learnt classes of dynamic 
events according to the ML criterion. More precisely, the label ^ assigned to the 
segment si takes its value in the label set {1, . . . , J} and is defined as follows : 

k = arg max P M i. ( ) x P M ^ («>*) (5) 

Prior on the classes could be introduced in (5) leading to a MAP criterion. 



Event label validation. By applying (5), we can label all the segments supplied 
by the first selection step. However, we have to consider that there might be “no 
play” segments wrongly labeled as “play” after the first selection step. We call 
them “intruders”. These segments are forced to be assigned one of the event 
classes using relation (5), which creates false detection. As a consequence, we 
propose a validation test, involving only residual motion models. It consists in 
testing for each segment Si the hypotheses defined by: 

J Hq : “ Si really belongs to the class li determined by (5)” 

| Hi : “ Si is labeled as Z^, whereas it is an intruder segment” 

To this end, a set of models M° res h as to t> e specified and estimated to represent 
the intruder segments. This will be explained in Section 5. 

The likelihood test to choose between this two hypotheses, is given by: 




152 G. Piriou, P. Bouthemy, and J.-F. Yao 



if 



ML 



>*) 



p mL S Zi) 



< £, Hi is accepted ; else, Hq is accepted. 



In this way, we can correct some misclassifications resulting from the imperfect 
output of the first selection step, by discarding the video segments which are 
rejected by the likelihood test. 



5 Learning the Dynamic Content Concepts 

For a given video genre, a training step is performed off-line in order to learn the 
residual motion models and the dominant motion models needed by the event 
detection method. Let us note that we have to divide the training set in two 
sub-sets. The first one is used to learn the motion models required by steps 1 
and 2 of the event detection algorithm, while the second one allows us to learn 
the intruder motion models. 



Learning the residual motion models for the two-group selection step. 

As the first selection step involves the scene motion only, we have to learn resid- 
ual motion models as specified in subsection 3 . 1 . Because of the large diversity 
of video contents in the two groups “play” and “no play” , we have to represent 
each group by several motion models. We apply the ascendant hierarchical clas- 
sification (AHC) technique, on one hand, to the “play” group, and on the other 
hand, to the “no play” group of the training set. The overall procedure is defined 
as follows. 

Step 0 : A residual motion model is estimated for each video segment belonging 
to the training set of the considered group. At this early stage, each segment 
forms a cluster. Step 1 : The two clusters (either composed of one segment or of 
several segments) found as the nearest w.r.t the symmetrized Kullback-Leibler 
distance between their corresponding residual motion models, are merged in the 
same cluster. The expression of this distance between two residual motion mod- 
els M], es sm&M 2 res is d{Ml es ,M 2 res ) = \{d K (Ml es , M 2 es ) + d K (M 2 es , M^ es )), 
where d,K(Ml es , M 2 es ) = d K(fv res , fv res ) + d K (fAv res ’ fL rfs )- The expression 
of the Kullback-Leibler distance between the density functions f^ res with pa- 
rameters (/A , <Ti ) , and fy res with parameters (^25^2)5 of the residual motion 
measurements is given by: 



d K (fl es JlJ=Piln(^J 



+ (1 — f 3 i)ln 



( ^2(1 - /A A 

VCTl(l - /fe)/ 



+ 



1-/A 

2 




The Kullback-Leibler distance between the density functions f\ Vres anc ^ /I v res 
of the temporal contrasts can be similarly written. A residual motion model 
is then estimated for the obtained new cluster. We iterate until the stopping 
criterion is satisfied. Stopping criterion : We stop if the maximum of the sym- 
metrized Kullback-Leibler distances between two clusters is lower than a certain 
percentage of the maximum of the distances computed at step 0. 
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At this stage, the load of manually labelling the video segments of the training 
set is kept low. Indeed, we just need to sort the video segments into the two 
groups “play” and “no play” . At the end, each group is represented by a (small) 
set of clusters (depending on the heterogeneity of the video segment contents 
of the group) and their associated residual motion models, both obtained in an 
automatic way. 



Learning the motion models of the event classes. Camera motion mod- 
els and residual motion models representing the different event classes to be 
recognized are required for the second step of our detection method. They are 
estimated from the same training set as the one used to learn residual motion 
models involved in the selection step. We first need a manual labelling of the 
“play” segments of the training set according to the events to detect. For each 
event class, a camera motion model is estimated from the video segments repre- 
senting the considered event as explained at the end of subsection 3.2. Similarly, 
the four parameters of the residual motion models for each event class are esti- 
mated using the ML criterion. 



Learning of intruder motion models. We have also to determine motion 
models, from the second subset of the training set, to represent the intruder 
segments. It is important to consider a different set of video segments than the 
one used to learn the models involved in the first steps of the detection method. 
The selection step is applied to the second subset of the training set. The intruder 
segments are then determined (since we have the ground truth on that training 
set) and submitted to the classification step of the method. Finally, we get a 
subset of intruder segments associated to each predefined event j, which allows 
us to estimate the associated residual motion model previously denoted by M J res . 

6 Experimental Results 

We have applied the described method on sports videos which involve complex 
contents while being easily specified. Moreover, events or highlights can be nat- 
urally related to motion information in that context. We report here results 
obtained on athletics and tennis videos. 



6.1 Experimental Comparison 

First, we have carried out an experimental comparison between our statistical 
approach and a histogram-based technique. In order to evaluate the probabilistic 
framework we have designed, we consider the same motion measurements for the 
histogram technique. Thus, the latter involves three histograms: the histogram of 
residual motion measurements v res (2), the histogram of their temporal contrasts 
Av res , and the 2D histogram of the camera- motion flow vectors (subsection 3.2). 
Each event j is then represented by three histograms: H ^ res , H\ Vres and H° carn . 
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Fig. 2. Athletics video: 2D histograms of the camera-motion flow vectors. Left: for a 
pole vault shot, right: for a long-shot of track race. 



Fig. 3. Athletics video: Detection of relevant events: Top row: ground-truth, middle 
row: results obtained with the probabilistic motion models, bottom row: results ob- 
tained with the histogram-based technique. From dark to light shining: pole vault, 
replay of pole vault, long-shot of track race and close-up of track-race 



To compare two histograms, we consider the Euclidian distance, denoted by d\ 
for ID histograms and by d 2 for 2D histograms. Several distances can be con- 
sidered to compare two histograms, and this issue has to be carefully addressed. 
However, the computed motion measurements are all real values and we have 
a huge number of available computed values. We can thus consider a very fine 
quantization and the resulting histograms are very close to the real continuous 
distributions. Moreover, the histogram distance is only used to rank the classes. 
The Euclidean distance is then a reasonable choicewhile easy to compute. These 
histograms are computed (and stored) for each event j from the training set of 
video samples. Then, we consider the test set and we compute the three his- 
tograms H ^ l res , H s 2 Vres and f° r eac ^ video segment Si of the test set. The 

classification step is now formulated as assigning the label k of the event which 
minimizes the sum of the distances between histograms: 



arg mm 



(d^H^HU+d^H^ 






Av r 



) + d 2 (H^ m ,Hi a 



(6) 

In order to focus on the classification performance of the two methods, the test 
set only involves “play” segments. We have processed a part of an athletics TV 
program which includes jump events and track race shots. The training set is 
formed by 12500 images and the test set comprises 7800 images. Some represen- 
tative images of this video are displayed on Figure 1. We want to recognize four 
events: Pole vault, Replay of pole vault, Long-shots of track race and Close-up 
of track race. Consequently, we have to learn four residual motion models and 
four camera motion models for the method based on the probabilistic motion 
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modelling. Figure 2 contains the 2D histograms of the camera- motion flow vec- 
tors for two classes. In Figure 3, the processed video is represented by a time 
line exhibiting the duration of the video segments. The “no play” segments have 
been in fact withdrawn, and the “play” segments have been concatenated to 
form the time line. A grey level is associated to each event class. The first row 
corresponds to the ground truth, the second one and the third one contain the 
results obtained respectively using the probabilistic motion models and using the 
histogram technique. These results demonstrate that the statistical framework 
yields quite satisfactory results and outperforms the histogram-based technique. 



6.2 Event Detection Method 

We have applied our event detection method to a tennis TV program. The first 
42 minutes (63000 images) of the video are used as the training set (22 minutes 
for the learning of the motion models involved in the two first steps and 20 
minutes for the learning of intruder models), and the last 15 minutes (18000 
images) form the test set. 



Selecting video segments. We want to distinguish between “play” segments 
involving the two tennis players in action and the “no play” segments including 
views of the audience, referee shots or shots of the players resting, as illustrated 
in Figure 4. We only exploit the first subset of the training set to learn the 
residual motion models that we need for the selection step. 205 video segments 
of the training set represent “play” segments and 95 are “no play” segments. 31 
residual motion clusters and their associated models are supplied by the AHC 
algorithm for the “play” group, and 9 for the “no play” group. The high number 
of clusters obtained reveals the diversity of dynamic contents in the two groups 
of the processed video. Quite satisfactory results are obtained, since the precision 
rate for the play group is 0.88 and the recall rate is 0.89. 




Fig. 4. Tennis video: Three image samples extracted from the group of “play” segments 
and three image samples extracted from the group of “no play” segments. 

Table 1. Tennis video: Results of the event detection method based on probabilistic 
motion models (P: precision, R: recall). 





Rally 


Serve 


Change of side 


p 


0.92 


0.63 


0.85 


R 


0.89 


0.77 


0.74 
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Detecting relevant events. The goal is now to detect the relevant events 
of the tennis video among the segments selected as “play” segments. For this 
second step, we introduce the probabilistic camera motion model. The three 
events we try to detect are the following: Rally, Serve and Change of side. In 
practice, we consider two sub-classes for the Serve class, which are wide-shot of 
serve and close-up of serve. Two sub-classes are considered too for the Change- 
of-side class. As a consequence, five residual motion models and five camera 
motion models have to be learnt. We have also to determine the residual motion 
models accounting for the intruder segments for each class. The obtained results 
are reported in Table 1. Satisfactory results are obtained specially for the rally 
class. The precision of the serve class is lower than the others. In fact, for the 
serve class, errors come from the selection step (i.e., some serve segments are 
wrongly put in the “no play” group, and then, are lost). It appears that a few 
serve segments are difficult to distinguish from some “no play” segments when 
using only motion information. However, the proposed statistical framework can 
easily integrate other information such as color or audio. 

7 Conclusion 

We have addressed the issue of determining dynamic content concepts from low- 
level video features with the view to detecting meaningful events in video. We 
have focused on motion information and designed an original and efficient statis- 
tical method. We have introduced new probabilistic motion models representing 
the scene motion and the camera motion. They can be easily computed from 
the image sequence and can handle a large variety of dynamic video contents. 
We have demonstrated that the considered statistical framework outperforms a 
histogram-based technique. Moreover, it is flexible enough to properly introduce 
prior on the classes if available, or to incorporate other kinds of video primitives 
(such as color or audio). The proposed two-step method for event detection is 
general and does not exploit very specific knowledge (e.g. related to the type of 
sport) and dedicated solutions. Satisfactory results on sports videos have been 
reported. 
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Abstract. Tensor voting is an efficient algorithm for perceptual group- 
ing and feature extraction, particularly for contour extraction. In this 
paper two studies on tensor voting are presented. First the use of iter- 
ations is investigated, and second, a new method for integrating curva- 
ture information is evaluated. In opposition to other grouping methods, 
tensor voting claims the advantage to be non-iterative. Although non- 
iterative tensor voting methods provide good results in many cases, the 
algorithm can be iterated to deal with more complex data configura- 
tions. The experiments conducted demonstrate that iterations substan- 
tially improve the process of feature extraction and help to overcome 
limitations of the original algorithm. As a further contribution we pro- 
pose a curvature improvement for tensor voting. On the contrary to the 
curvature-augmented tensor voting proposed by Tang and Medioni, our 
method takes advantage of the curvature calculation already performed 
by the classical tensor voting and evaluates the full curvature, sign and 
amplitude. Some new curvature-modified voting fields are also proposed. 
Results show a lower degree of artifacts, smoother curves, a high toler- 
ance to scale parameter changes and also more noise-robustness. 



1 Introduction 

Medioni and coworkers developed tensor voting as an efficient method for con- 
tour extraction and grouping. The method, supported by the Gestalt psychology, 
is based on tensor representation of image features and non-linear voting, as de- 
scribed in [2]. See also [9] for a comparison with other existing methods. Tensor 
voting is a non-iterative procedure, in the sense that the original scheme im- 
plements only 2 steps of voting, claiming that no more iterations are needed. 
In opposition, other methods for perceptual grouping [4,3,1] refine the results 
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by iterative feedforward-feedback loops. Therefore, the aim of this study is to 
investigate how an incremented number of iterations can improve the results 
of tensor voting. Some basic examples are analyzed and an extraction quality 
measurement is proposed. The later allows to perform a statistical study on the 
influence of iterations in a simple case. 

A curvature improvement has been proposed by Tang and Medioni [7] . They 
compute the sign of curvature and use it for modifying the voting fields. We 
propose a more sophisticated calculation of the curvature information with a 
low computational cost. Instead of the sign of curvature, the proposed method 
evaluates the full curvature using part of the calculations previously performed 
by the tensor voting. We adopt a curvature compatibility approach that was 
described by Parent and Zucker [6] . A statistical evaluation is presented and the 
methods are finally tested with more complex data in presence of noise. 

Section 2 briefly introduces the tensor voting method. Section 3 presents a 
study on the usefulness of iterations for tensor voting and the section 4 describes 
some improvements that can be achieved when both curvature information and 
iterations are used. Some concluding remarks are drawn in section 5. 

2 A Brief Introduction to Tensor Voting 

The classical algorithm will not be fully described in detail here and only a brief 
description is presented in order to stress the new contributions of this paper. 
For a more in depth study the reader can refer to [2,7]. Also, it is necessary to 
remark that the present work is only restricted to still 2D images, but it could 
be extended to N-dimensional features, like volumetric data or motion [8,5]. 

A local description of the curves at each point of the image can be encoded by 
a symmetric positive 2x2 tensor. Tensors can be diagonalized, their eigenvalues 
are denoted Ai, A 2 with Ai > A 2 > 0 and corresponding eigenvectors are denoted 
by ei, e 2 - Tensors can be decomposed as follows: 

T=(A 1 -A 2 )e 1 ef + X 2 I (1) 

where I is the identity matrix. The first term is called the stick component , where 
e\ is an evaluation of the tangent to the curve. The stick saliency Ai — A 2 gives a 
confidence measure for the presence of a curve. The second term is called the ball 
component , and its saliency A 2 gives a confidence measure to have a junction. 

The classical tensor voting algorithm performs two voting steps in which 
each tensor propagates to its neighborhood. Stick tensors propagate mostly in 
the direction of e\. The region of propagation is defined by the stick voting field 
which decay in function of the distance and curvature (see Eq. 3 and Fig. 2.h). 
Ball tensors propagate in all directions and decay with the distance. After all 
tensors are propagated, all contributions are summed up to define new tensors 
that will be used for the next step. That summation can be considered as an 
averaging or a “voting”. The first voting step is referred as “sparse vote ” because 
the vote is performed only on points where tensors are not null. The last voting 
step is called “dense vote ” because the vote is accomplished on every point. After 
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Fig. 1. Classical tensor voting consists of four steps. (1) Tensor initialization, (2) 
sparse voting, (3) dense voting, and (4) feature extraction. The new contributions are 
depicted with boldface characters, which describe iterations of the sparse voting process 
and curvature calculations during the sparse vote stage, modifying the voting fields by 
incorporating the calculated curvature. 



all the voting steps are completed, curves are extracted as local maximum of stick 
saliency along the normal direction to stick components. Note that thresholds 
are necessary to eliminate low-saliency local maxima. These thresholds are held 
constant for each of the following experiments. Fig. 1 summarizes the different 
steps of the algorithm showing with boldface characters the new contributions 
proposed: an iterative sparse voting mechanism and a curvature calculation for 
modifying the voting fields. 



3 Iterated Tensor Voting 

3.1 Example 

Tensor voting is a very efficient technique for grouping data-points that are 
separated by almost the same distance. A free parameter a (the scale factor, 
see Eq. 3) has to be adjusted to the inter- distance between points. If a is miss- 
adjusted, performance results strongly decrease: if a is too small points will not 
be grouped, if a is too big the grouping is less selective. 

Fig. 2. a shows a simple example with two sets of points: first a three by 
three array of points separated by 11 pixels vertically and 13 pixels horizontally. 
Because the vertical distance is smaller, following Gestalt psychology rules, these 
points have to be grouped vertically. Secondly, a set of three points aligned in 
diagonal and separated by 42 pixel gaps. Because the gaps are different in both 
sets of points it is not possible to adjust a for extracting both structures correctly. 
As it is shown in Fig. 2.c, if a is small, i.e. around 5, only the vertical array 
is grouped. If a is bigger than 15, only the diagonal line is correctly grouped 
(Fig. 2.i). Between these values (from a =7 to 13, Fig. 2.e and g) none of these 
sets of points are accurately grouped. 
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(a) 



i 



(b) 




non-iterative 




Fig. 2. Example showing the tensor voting results for different values of the scale 
factor a and of the number of iterations, a. Data points belong to two sets: three 
points aligned in diagonal and an array which has to be grouped in vertically lines, b. 
Contours of the voting field for a = 5 are drawn at 50% (solid line) and 5% (dash-dot 
line) of the maximum value (see Eq. (3) for voting fields description), c. Extraction 
results with the classical tensor voting algorithm (two voting steps) and a = 5: array 
structure is accurately extracted, but the scale factor is too small to group diagonal 
points, d. and f. Voting fields with a = 9 and a — 12, respectively, e. and g. Contours 
extracted respectively with a = 9 and a = 12, by the non-iterative tensor voting. In 
both cases algorithm fails to find both array and diagonal points structures. The scale 
factor a is too big for the array and too small for the diagonal points, h. and j. Voting 
field with a = 15. i. With non-iterative tensor voting and a — 15, diagonal points are 
correctly grouped, but not array points, k. With a = 15 and 20 iterations the structure 
is accurately extracted, both array and diagonal line are correctly grouped. 



Iterations are implemented on the sparse voting stage. For n iterations, n— 1 
sparse votes and one dense vote are required, as shown Fig. 1. An increased 
number of iterations can refine the results until the correct structure is extracted. 
Fig. 2.k shows the results with a = 15 and 20 iterations. Both array and diagonal 
line structures are now simultaneously extracted what non-iterative algorithm 
was not able to do. Note that a normalization stage is applied after each iteration 
to keep the sum of tensor eigenvalues constant. 
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(c) 

Fig. 3. Systematic evaluation of the influence of iterations using a three by three array 
of points, a. Original array of points separated by Ax , Ay (parameters are the same 
in all insets: Ax — 4, Ay — 9 and a — 10). b. Contour extraction with the non- 
iterative tensor voting fails: central points are not grouped, lateral points are grouped 
but not in strictly vertical lines, moreover there are some artifacts (Q=0.40). c. The 
structure is well extracted after 10 iterations of voting: points are grouped in vertical 
lines (Q=2.38). 





(a) 



3.2 Statistics on the Influence of Iterations 

A 3x3 array of points, shown in Fig. 3. a, is used to evaluate the effect of iterations 
on tensor voting. Vertical and horizontal distances between points are denoted 
Ax and Ay respectively. In the following, Ax will be chosen smaller than Ay. 
In such case points have to be grouped vertically (on the contrary if Ax > Ay 
points would have to be grouped horizontally). Taking into account that points 
have to be grouped vertically, a measure of how good tensor orientation is can 
be represented by: 



Q = —logio 



E 

i= 1,...,9 



1 - 



T»(l,l) 

Si 



(2) 



where i indexes the 9 points of the array. T{ is the tensor of the point i, Si is the 
sum of eigenvalues of Ti and T$( 1, 1) the vertical component of the tensor T-. 

As vertical lines have to be extracted, tensors are correctly oriented if they 



have a form close to T 



Si 



1 0 
00 



In such case ^2(1- 



Ti( UK • , 

- — gr — ) is close to 

zero, providing a high value for Q. Thus, Q can be considered as an extraction 
quality measurement for the described experiment. When Q < 1 tensors are 
miss-oriented and extraction can be considered as failed. On the contrary Q > 2 
indicates tensors are well orientated and the structure is correctly extracted. 



3.3 Results 

Fig. 4 presents results for different parameters Ax, Ay and n (number of iter- 
ations). For all cases the scale factor a is fixed to 10. Again, please note that 
in this study we are only considering cases where Ax < Ay (which should yield 
vertical grouping following Gestalt rules of proximity and good continuation). 
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Fig. 4. Extraction quality as a function of array parameters Ax and Ay for the grid 
example of Fig. 3. The number of iterations n is indicated by different gray shades 
in the bars (two iterations bar corresponds to the classical algorithm with two voting 
steps), a = 10 is held constant for the entire experiment. Only cases with Ax < Ay 
are shown here. a. With a fixed Ax = 4 and 5 < Ay < 13. If Ay < a, a is too large in 
comparison to the features and the extraction fails even if more iterations are deployed. 
If 9 < Ay <11 the structure is extracted using several iterations (results start from 
failed (Q < 1) when using the non-iterative algorithm up to accurate (Q > 2) when 
more iterations are deployed). Only if Ay >12 the non-iterative algorithm is able to 
extract the desired information, b. Ax = 8 and 9 < Ay < 15. c. Ax = 13 and 14 < 
Ay < 19. In difficult cases like when Ax ~ Ay or Ay ~ a several iterations are needed 
for extracting accurately the structure, d. Ax = 25 and 26 < Ay <31. Although a is 
too small in comparison to the features, an accurate extraction is obtained due to the 
infinite Gaussian extension of the propagation fields. 



Extraction is accurate for any number of iterations if a corresponds to the 
optimal scale for the investigated stimuli and if there is no competition be- 
tween vertical and horizontal grouping, that is, if a Ay and Ax Ay (see 
Fig. 4.a,b,c,d at their rightmost parts). 
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If Ay <C cr it is impossible to extract the structure even if more iterations 
are deployed (see Fig. 4. a left part), the scale factor is indeed too large to be 
selective enough. 

If Ay ~ a the application of the classical algorithm fails to extract the curves. 
On the contrary, iterations allow tensor voting obtaining the correct structure as 
it can be observed in Fig. 4. a center part and Fig. 4.b left part. A similar situation 
is observed if Ax ~ Ay and Ax , Ay are not much bigger than a. Iterated tensor 
voting allow to extract the structure where the classical algorithm fails (see 
Fig. 4.c left part). 

In conclusion, only if the features to be extracted are simple and they do not 
appear in competition, the non-iterative algorithm would suffice for correctly 
extracting image features. For more complicated cases, when some competition 
between orientations is present or when the scale factor a is not precisely ad- 
justed, more than two iterations are required. Moreover, it has been seen that 
in almost all cases iterations do not impair the quality of the results and on the 
contrary they allow to refine the final structures. In all, the use of iterations 
can help to overcome the limitations of the non-iterative method, improving the 
feature extraction results. 



4 Curvature Improvement 

4.1 Method 



The proposed curvature improvement introduces a curvature calculation and 
modified stick voting fields. The curvature is evaluated in each voter point by 
averaging over all receiver points the curvature calculation p already computed 
in the classical tensor voting. In the classical tensor voting, a voter A votes on 
a receiver B with an amplitude described by the stick voting field equation: 



V ( A , P) = exp 



^ S (i,g) 2 + c/ )(A,fl) 2 j 



( 3 ) 



with 



p(A,B) 



2sin0 

d 



( 4 ) 



where s(A, P) and p(A, P) are respectively the length and the curvature of the 
circular arc which is tangent to e\ (A) in point A and goes through point B 
(see Fig. 5. a), d is the Euclidean distance between both points A and P, # is the 

angle between vectors e[ (A) and AB. a -the scale factor- and c are constants. 
Fig. 2.b,d,f,h shows the contours of such voting fields for different values of a. 

The curvature will be evaluated in each voter point A. To permit inflexion 
points and changes of curvature, the curvature is calculated separately in both 

half planes P + and P_ defined respectively by P + = {P, (e[ (A), AB) > 0} and 

P_ = {P, (ei (A), AB) < 0}. The weighted average over each half plane gives 
7 i(A) (where i = + or — ), which is a curvature evaluation at the point A: 
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7 M) = ^ (5) 

bV (Xi(B)-X 2 (B))V(A,B) 

where Ai(T>), X 2 (B) are the eigenvalues of the tensor B. The weighted average is 
very similar to the “voting” used in tensor voting: the same weighting functions 
composed by the voting fields V and the stick saliency Ai — A 2 are used. 

The 7 i determined at one iteration, can then be used in the next iteration 
for modifying the stick voting fields. The following equation extends Eq. 3: 




Fig. 5. a. Tensor voting fields are build calculating the distance d, the angle 0 , the arc 
longitude s and the curvature p between the voter A oriented by its first eigenvector 
e\ and the receiver B. In the curvature improvement the curvature is evaluated in the 
voter A by averaging p over all receivers, b. Classical voting field without curvature. 
Contours are drawn at 50% and 5% of the maximum value, a = 15 for all voting fields of 
the figure, c. Symmetric curved voting field with curvatures 7 + = 7 “ = .06. d. Curved 
voting field with different curvature in both half planes, 7 + — .09 and 7 — .03. e. 

Curved voting field with inflexion, 7 + = .06 and 7 “ = —.06. 



Some examples of such curvature-modified voting fields are shown 
Fig. 5.c,d,e. See in comparison the former contours Fig. 5.b. In points where 
the ball component has a significant level in comparison to the stick component, 
curvatures have to be considered as zero because no reliable curvature calcula- 
tion is possible if curve orientation is itself not reliable. Therefore curved voting 
fields are employed only where tensor orientation has high confidence (the curved 
voting fields are only used under the condition ^ > 10 ). 

Remarkably the method follows the “voting” methodology. Curvature is 
found by averaging. Moreover it uses the same voting fields V as tensor vot- 
ing. It can then be hoped to conserve the good properties of the tensor voting, 
like the robustness to noise. The curvature improvement does not entail an im- 
portant additional computational cost in comparison to the classical method, 
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while it uses the same kind of operations as the tensor voting and reuses calcu- 
lations already done, i.e. in the curvature calculation of Eq. 5 all variables Ai, 
A 2 , V and p are already computed by the classical tensor voting. 

Note also that an increased number of iterations is necessary to refine the 
results. The number of iterations can be considered as an additional parameter 
of the algorithm. A procedure could also be implemented for stopping the iter- 
ations when the results do not change much from one iteration to the following 
one. For all examples presented here a fixed number of iterations is used. 10 it- 
erations have be seen to be sufficient unless data structure presents some special 
ambiguity. 

In the following, the curvature improvement will be compared with the non- 
iterative tensor voting and iterative tensor voting without curvature improve- 
ment. Results need to be compared with Tang and Medioni’s method taking 
into account the sign of curvature [7], although this was out of the scope of the 
present study. 

4.2 Statistical Study 

Fig. 6. a shows an image composed by sparse points located on the edges of an 
ellipse. The distance between points vary between 6 to 12 pixels. This example 
is used for comparing the three versions of the algorithm. For different values of 
the scale factor <7, we count the number of points erroneously extracted outside 
the ellipse contour, tolerating a deviation of two pixels around the ideal ellipse. 
Results are presented in Fig. 6.b-e. 

All versions of the algorithm require a a value to be higher than a minimum 
value (cr > 7 in the present case) for extracting the contour of the ellipse. With 
a smaller value of cr, points are not grouped together. On the other hand, a 
needs to be small for avoiding artifacts, i.e. the number of misplaced points 
increases strongly for tensor voting without curvature information for a > 10, 
and for a > 34 if the curvature improvement is considered. Classical tensor 
voting adequately extracts the contours, although with artifacts, for a between 
7 and 10. Iterations have few influence on the results. In comparison curvature 
improvement extracts adequately the ellipse over a large range of a values, i.e. 
between 7 to 34. Moreover it does not produce any artifacts for a between 7 and 
21 and yields smoother slopes. 



4.3 Hand- Written Text Example 

Fig. 7 shows another example of contour extraction with the three versions 
of the algorithm: non-iterative, iterative with 10 iterations and iterative with 
the curvature improvement (with also 10 iterations). The first image “Cyan” 
(Fig. 7. a) is composed of sparse points along handwritten characters. The second 
one (Fig. 7.b) is the same image “Cyan” with 20% of noise (i.e. every fifth data 
point is noise). Same parameters are used for each method. After tensor voting 
is applied the contours of the letters are extracted. Results show tensor voting 
with 10 iterations (Fig. 7.e) reduces the artifacts and closes the curves better 
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Sigma (scale factor) 




Fig. 6. Comparison between the three methods, a. The input image is composed by 
a sparse set of dots dispersed along the edges of an ellipse. In insets a., b. and c. all 
parameters are the same and a = 8. b. and c. Extraction results with, respectively, 
the non-iterative algorithm and 10 iterations of tensor voting. The ellipse is adequately 
extracted but artifacts can be observed, moreover slopes are not smooth. Both methods 
provide similar results, d. With the curvature improvement and 10 iterations, the 
ellipse is extracted without artifacts and with smooth curves, e. Results for a varying 
between 7 and 27 are presented. The number of points erroneously extracted, that 
is extracted out of the ellipse are plotted for each method. Tensor voting without 
curvature information extract the ellipse, although always with artifacts, for a between 
7 and 10. Curvature improvement extracts it without artifacts and tolerates a larger 
range of a (from 7 to 21). 



than non-iterative tensor voting (Fig. 7.c). With the curvature improvement 
(Fig. 7.g) extracted contours of the curves have even less artifacts and are much 
smoother. Comparison of the results with the noisy image (Fig. 7.d,f,h) shows 
that curvature improvement does not impair the quality but even improves it, 
e.g. contour continuity is better preserved. 

For regions with straight segments and junctions both curvature improvement 
and iterative tensor voting behaves similarly. Therefore, curvature improvement 
does not impair the results for such situations. As a consequence the curvature 
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Fig. 7. A hand written example, a. The test image “Cyan” is a 128x304 pixel image 
composed by points dispersed along hand- written letters. For better visualization points 
are magnified, b. The second test image is the same image “Cyan” with 20% noise. 
Parameters are the same for all experiments (a = 15). c and d. Extraction results of 
respectively the image “Cyan” and the noisy image version with non-iterative tensor 
voting. In both cases the algorithm fails to close the curves and yields high level of 
artifacts, e and f. Extraction results of “Cyan” images with 10 iterations. Curves are 
better closed and the level of artifacts is lower than with non-iterative tensor voting, g 
and h. Extraction results with the curvature improvement and 10 iterations. The text 
is accurately extracted, with less artifacts and smoother slopes. Results resist slightly 
better to noise than without curvature improvement. It is remarkable that the curve 
continuity of the letters C, Y and N is preserved. 








Are Iterations and Curvature Useful for Tensor Voting? 



169 



improvement can be used for any kind of images. Remarkably, curvature im- 
provement accurately extracts the structure of the example Fig. 2. a using the 
same parameters (cr = 15 and 20 iterations). 

5 Conclusion 

This paper demonstrated that iterations are useful for tensor voting, particularly 
for extracting correct contours in difficult situations like feature competition or 
scale parameter misadjustment. In almost all cases iterations do not impair the 
quality of the results and on the contrary they allow refining and improving the 
final structures. The curvature improvement provides better results for curved 
features as it reduces the level of artifacts and smoothes curves, besides the fact 
that it also increases the robustness of the method to scale parameter misad- 
justment and noise. 
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Abstract. Planar motion models can provide gross motion estimation 
and good segmentation for image pairs with large inter-frame disparity. 
However, as the disparity becomes larger, the resulting dense correspon- 
dences will become increasingly inaccurate for everything but purely pla- 
nar objects. Flexible motion models, on the other hand, tend to overfit 
and thus make partitioning difficult. For this reason, to achieve dense 
optical flow for image sequences with large inter-frame disparity, we pro- 
pose a two stage process in which a planar model is used to get an 
approximation for the segmentation and the gross motion, and then a 
spline is used to refine the fit. We present experimental results for dense 
optical flow estimation on image pairs with large inter-frame disparity 
that are beyond the scope of existing approaches. 



1 Introduction 

Layer-based motion segmentation based on differential optical flow [18,19] can 
provide good estimation of both the coherent groups in image sequence as well 
as the associated motion of each group. However, that work is only applicable to 
scenes where the inter-frame disparity is small. There are two major problems 
that arise as the disparity increases. The first is that if the disparity exceeds 
roughly 10-15% of the image size, then even coarse-to-fine optical flow will not 
be able to find the solution [7]. The second is that with large disparity the 
planar motion model associated with the layers (e.g. rigid, affine) likely becomes 
inaccurate for everything but purely planar objects. 

In this paper our goal is to determine dense optical flow - by optical flow we 
are referring to dense correspondences and not the method of differential optical 
flow estimation - between image pairs with large inter-frame disparity. We 
propose a two-stage framework based on a planar motion model for capturing 
the gross motion followed by a regularized spline model for capturing finer 
scale variations. Our approach is related in spirit to the deformotion concept 
in [12], developed for the case of differential motion, which separates overall 
motion (a finite dimensional group action) from the more general deformation 
(a diffeomorphism) . 
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Fig. 1 . Non-planarity vs. non-rigidity: The left image pair shows a non-planar object 
undergoing 3D rigid motion; the right pair shows an approximately planar object under- 
going non-rigid motion. Both examples result in residual with respect to a 2D planar fit. 

The types of image pairs that we wish to consider are illustrated in Figure 1. 
These have a significant component of planar motion but exhibit residual with 
respect to a planar fit because of either the non-planarity of the object (e.g. a 
cube) or the non-rigidity of the motion (e.g. a lizard). These are scenes for which 
the motion can be approximately described by a planar layer-based framework, 
i.e. scenes that have “shallow structure” [10]. 

It is important to remember that optical flow does not model the 3D motion 
of objects, but rather the changes in the image that result from this motion. 
Without the assumption of a rigid object, it is very difficult to estimate the 3D 
structure and motion of an object from observed change in the image, though 
there is existing work that attempts to do this [5,17]. For this reason, we choose 
to do all estimation in the image plane (i.e. we use 2D models), but we show 
that if the object is assumed to be rigid, the correspondences estimated can be 
used to recover the dense structure and 3D motion. 

This approach extends the capabilities of feature-based scene matching al- 
gorithms to include dense optical flow without the limits on allowable motion 
associated with techniques based on differential optical flow. Previously, feature- 
based approaches could handle image pairs with large disparity and multiple in- 
dependently moving objects, while optical flow techniques could provide a dense 
set of pixel correspondences even for objects with non-rigid motion. However, 
neither type of approach could handle both simultaneously. Without the as- 
sumption of a rigid scene, existing feature-based methods cannot produce dense 
optical flow from the sparse correspondences, and in the presence of large dis- 
parity and multiple independently moving objects, differential optical flow (even 
coarse-to-fine) can break down. The strength of our approach is that dense op- 
tical flow can now be estimated for image pairs with large disparity, more than 
one independently moving object, and non-planar (including non-rigid) motion. 

The structure of the paper is as follows. We will begin in Section 2 with an 
overview of related work. In Section 3, we detail the components of our approach. 
We discuss experimental results in Section 4. There is discussion in Section 5 
and the paper concludes with Section 6. 

2 Related Work 

The work related to our approach comes from the areas of motion segmenta- 
tion, optical flow and feature-based (sparse) matching. Several well known ap- 
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proaches to motion segmentation are based on dense optical flow estimation [1,3, 
8] ; in these approaches the optical flow field was assumed to be piecewise smooth 
to account for discontinuities due to occlusion and object boundaries. Wang & 
Adelson introduced the idea of decomposing the image sequence into multiple 
overlapping layers, where each layer represents an affine motion field [18]. How- 
ever their work was based on differential optical flow, which places strict limits 
on the amount of motion between two frames. 

In [19], Weiss uses regularized radial basis functions (RBFs) to estimate dense 
optical flow; Weiss’ method is based on the assumption that while the motion 
will not be smooth across the entire image, the motion is smooth within each 
of the layers. Given the set of spatiotemporal derivatives, he used the EM algo- 
rithm to estimate the number of motions, the dense segmentation and the dense 
optical flow. This work along with other spline-based optical flow methods [13, 
14] however, also assumes differential motion and therefore does not apply for 
the types of sequences that we are considering. 

In [16], Torr et al. show that the trifocal tensor can be used to cluster groups 
of sparse correspondences that move coherently. This work addresses similar 
types of sequences to those of our work in that it is trying to capture more 
than simply a planar approximation of motion, but it does not provide dense 
assignment to motion layers or dense optical flow. The paper states that it is 
an initialization and that more work is needed to provide a dense segmenta- 
tion, however the extension of dense stereo assignment to multiple independent 
motions is certainly non-trivial and there is yet to be a published solution. In 
addition, this approach is not applicable for objects with non-rigid motion, as 
the fundamental matrix and trifocal tensor apply only to rigid motion. 

Our work builds on the motion segmentation found via planar motion models 
as in [20] , where planar transformations are robustly estimated from point corre- 
spondences in a RANSAC framework. A dense assignment of pixels to transfor- 
mation layers is then estimated using an MRF. We refine the planar estimation 
produced by [20] using a regularized spline fit. Szeliski & Shum [14] also use a 
spline basis for motion estimation, however their approach has the same limita- 
tions on the allowable motion as other coarse-to-fine methods. 

3 Our Approach 

Our goal in this paper is to determine the dense optical flow for pairs of images 
with large inter-frame disparity and in the presence of multiple independent 
motions. If the scene contains objects undergoing significant 3D motion or de- 
formation, the optical flow cannot be described by any single low dimensional 
image plane transformation (e.g. an affine transformation or a homography). 
However, to keep the problem tractable we need a compact representation of 
these transformations; we propose the use of thin plate splines for this purpose. 
A single spline is not sufficient for representing multiple independent motions, 
especially when the motion vectors intersect [19]. Therefore we represent the 
optical flow between two frames as a set of disjoint splines. By disjoint we mean 
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Fig. 2. Determining Long Range Optical Flow. The goal is to provide dense optical 
flow from the first frame (1), to the second (4). This is done via a planar fit (2) followed 
by a flexible fit (3). 



that the support of the splines are disjoint subsets of the image plane. The 
task of fitting a mixture of splines naturally decomposes into two subtasks: mo- 
tion segmentation and spline fitting. Ideally we would like to do both of these 
tasks simultaneously, however these tasks have conflicting goals. The task of 
motion segmentation requires us to identify groups of pixels whose motion can 
be described by a smooth transformation. Smoothness implies that each mo- 
tion segment has the the same gross motion, however except for the rare case 
in which the entire layer has exactly the same motion everywhere, there will be 
local variations. Hence the motion segmentation algorithm should be sensitive to 
inter-layer motion and insensitive to intra-layer variations. On the other hand, 
fitting a spline to each motion field requires attention to all the local variations. 
This is an example of different tradeoffs between bias and variance in the two 
stages of the algorithm. In the first stage we would like to exert a high bias and 
use models with a high amount of stiffness and insensitivity to local variations, 
whereas in the second stage we would like to use a more flexible model with a 
low bias. 

The motion segmentation consists of a two stage RANS AC-based robust 
estimation procedure which operates on a sparse set of correspondences between 
the two frames. Any planar transformation can be used as the motion model in 
this stage; we use homographies in this paper. Once the dominant motions have 
been detected, a dense assignment is performed using a fast graph partitioning 
algorithm. 

The output of the first stage, while sufficient to achieve a good segmentation 
is not sufficient to recover the optical flow accurately. However it serves two 
important purposes: firstly it provides an approximate segmentation of the sparse 
correspondences that allows for coherent groups to be processed separately. This 
is crucial for the second stage of the algorithm as a flexible model will likely find 
an unwieldy compromise between distinct moving groups as well as outliers. 
Secondly, since the assignment is dense, it is possible to find matches for points 
that were initially mismatched by limiting the correspondence search space to 
points in the same motion layer. The second stage then bootstraps off of these 
estimates of motion and layer support to iteratively fit a thin plate spline to 
account for non-planarity or non-rigidity in the motion. Figure 2 illustrates this 
process. 

We now describe the two stages of the algorithm in detail. 
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3.1 Detecting Dominant Planar Motion 

We begin by finding planar approximations of the motion in the scene as well as 
a dense assignment of pixels to motion layers. We use the motion segmentation 
algorithm of [20]. An example of this is shown in Figure 3 



Example of Planar Fit and Segmentation. Figure 3 shows an exam- 



ple of the output from the planar 
figure we show the two images, I and 
and the assignments for each pixel 
to a motion layer (one of the three 
detected motion fields). The columns 
represent the different motion fields 
and the rows represent the portions 
of each image that are assigned to a 
given motion layer. The motions are 
made explicit in that the pixel sup- 
port from frame to frame is related ex- 
actly by a planar homography. Notice 
that the portions of the background 
and the dumpsters that were visible 
in both frames were segmented cor- 
rectly, as was the man. The result of 
the spline fit for this example will be 
shown in Section 4. 



fit and segmentation process. In this 




Fig. 3. Notting Hill sequence. (Row 1) 
Original image pair of size 311 x 552, (Row 
2) Pixels assigned to warp layers 1-3 in /, 
(Row 3) Pixels assigned to warp layers 1-3 
in T. 



3.2 Refining the Fit with a Flexible Model 

The flexible fit is an iterative process using regularized radial basis functions, in 
this case Thin Plate Spline (TPS). The spline interpolates the correspondences 
to result in a dense optical flow field. This process is run on a per-motion layer 
basis. 



Feature Extraction and Matching. During the planar motion estimation 
stage, only a gross estimate of the motion is required so a sparse set of feature 
points will suffice. In the final fit however, we would like to use as many corre- 
spondences as possible to ensure a good fit. In addition, since the correspondence 
search space is reduced (i.e. matches are only considered between pixels assigned 
to corresponding motion layers), matching becomes somewhat simpler. For this 
reason, we use the Canny edge detector to find the set of edge points in each of 
the frames and estimate correspondences in the same manner as in [20]. 

Iterated TPS Fitting. Given the approximate planar homography and the 
set of correspondences between edge pixels, we would like to find the dense set 
of correspondences. If all of the correspondences were correct, we could jump 
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straight to a smoothed spline fit to obtain dense (interpolated) correspondences 
for the whole region. However, we must account for the fact that many of the 
correspondences are incorrect. As such, the purpose of the iterative matching is 
essentially to distinguish inliers from outliers, that is, we would like to identify 
sets of points that exhibit coherence in their correspondences. 

One of the assumptions that we make about the scenes we wish to consider 
is that the motion of the scene can be approximated by a set of planar layers. 
Therefore a good initial set of inliers are those correspondences that are roughly 
approximated by the estimated homography. From this set, we use TPS regres- 
sion with increasingly tighter inlier thresholds to identify the final set of inliers, 
for which a final fit is used to interpolate the dense optical flow. We now briefly 
describe this process. 

The Thin Plate Spline is the Radial Basis Function (RBF) that minimizes 
the following bending energy or integral bending norm [4], 

J f = [ f ( fix + 2 fly + fyy)dxdy 

JJ R 2 

where / = f{x,y) represents the x or y component of the transformation for 
the pixel at position (x, y). In our approach we use a regularized version of TPS 
fitting in which /i controls the tradeoff between data and smoothing in the cost 
functional 

h [/] = yy* - f(xt, yi)) 2 + fiif 

i 

where Vi represents the target of the transformation and f{x^ yi) is the mapped 
value for the point at location (x^yi). Since each point gets mapped to a new 
2D position, we require two TPS transformations, one for the ^-coordinates and 
another for the //-coordinates. We solve for this transformation as in [9]. 

We estimate the TPS mapping 
from the points in the first frame to 
those in the second where y t is the 
regularization factor for iteration t. 

The fit is estimated using the set of 
correspondences that are deemed in- 
liers for the current transformation, 
where r t is the threshold for the t th 
iteration. After the transformation is 
estimated, it is applied to the entire 
edge set and the set of correspon- 
dences is again processed for inliers, using the new locations of the points for 
error computation. This means that some correspondences that were outliers 
before may be pulled into the set of inliers and vice versa. The iteration con- 
tinues on this new set of inliers where r t + 1 < r t and /q+i < /R- We have found 
that three iterations of this TPS regression with incrementally decreasing reg- 
ularization and corresponding outlier thresholds suffices for a large set of real 
world examples. Additional iterations produced no change in the estimated set 
of inlier correspondences. 



I. Estimate planar motion 

1. Find correspondences between I and I' 

2. Robustly estimate the motion fields 

3. Densely assign pixels to motion layers 
II. Refine the fit with a flexible model 

4. Match edge pixels between I and I' 

5. For t= 1:3 

6. Fit all correspondences within rt 

using TPS regularized by fit 

7. Apply TPS to set of correspondences 
Note: (r t +i < r t , Ht+i < fJ>t) 



Fig. 4. Algorithm Summary 
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This simultaneous tightening of the pruning threshold and annealing of the 
regularization factor aid in differentiating between residual due to localization 
error or mismatching and residual due to the non-planarity of the object in 
motion. When the pruning threshold is loose, it is likely that there will be some 
incorrect correspondences that will pass the threshold. This means that the 
spline should be stiff enough to avoid the adverse effect of these mismatches. 
However, as the mapping converges we place higher confidence in the set of 
correspondences passing the tighter thresholds. This process is similar in spirit 
to iterative deformable shape matching methods [2,6]. 

4 Experimental Results 

We now illustrate our algorithm, which is summarized in Figure 4, on several 
pairs of images containing objects undergoing significant, non-planar motion. 
Since the motion is large, dis- 
playing the optical flow as a vec- 
tor field will result in a very 
confusing figure. Because of this, 
we show the quality of the op- 
tical flow in other ways, includ- 
ing (1) examining the image and 
corresponding reconstruction er- 
ror that result from the applica- 
tion of the estimated transform to 
the original image (we refer to this 
transformed image as T(/)), (2) 
showing intermediate views (as in 
[11]), or by (3) showing the 3D 
reconstruction induced by the set 
of dense correspondences. Exam- 
ples are presented that exhibit 
either non-planarity, non-rigidity 
or a combination of the two. We 
show that our algorithm is capa- 
ble of providing optical flow for 
pairs of images that are beyond 
the scope of existing algorithms. We performed all of the experiments on 
grayscale images using the same parameters 1 . 

4.1 Face Sequence 

The first example is shown in Figures 5 and 6. The top row of Figure 5 shows 
the two input frames, / and in which a man moves his head to the left in 

1 k = 2, A = .285, Tp — 15, fii — 50, fi 2 — 20, /i 3 = 1, n = 15, T 2 — 10 , 73 — 5. Here, k, 
A, and r p refer to parameters in [20]. 




& 
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Fig. 5. Face Sequence. (1) The two input im- 
ages, / and I' of size 240x320. (2) The difference 
image is show first where grey regions indicate 
zero error regions and the reconstruction, T(/) 
is second. (3) The initial segmentation found via 
planar motion. 



A Approach for Determining Dense Long Range Correspondences 177 



front of a static scene (the nose moves more than 10% of the image width). The 
second row shows first the difference image between T(I) and I' where error 
values are on the interval [-1,1] and gray regions indicate areas of zero error. 
This image is followed by T(/); this image has two estimated transformations, 
one for the face and another for the background. Notice that error in the overlap 
of the faces is very small, which means that according to reconstruction error, the 
estimated transformation successfully fits the relation between the two frames. 
This transformation is non-trivial as seen in the change in the nose and lips 
as well as a shift in gaze seen in the eyes, however all of this is captured by 
the estimated optical flow. The final row in Figure 5 shows the segmentation 
and planar approximation from [20], where the planar transformation is made 
explicit as the regions’ pixel supports are related exactly by a planar homography. 



Dense correspondences allow for the 
estimation of intermediate views via 
interpolation as in [11]. Figure 6 
shows the two original views of the 
segment associated with the face as 
well as a synthesized intermediate 
view that is realistic in appearance. 
The second row of this figure shows 
an estimation of relative depth that 
comes from the disparity along the 
rectified horizontal axis. Notice the 
shape of the nose and lips as well as 
the relation of the eyes to the nose and 
forehead. It is important to remember 
that no information specific to human 
faces was provided to the algorithm 
for this optical flow estimation. 

4.2 Notting Hill Sequence 

The next example shows how the 
spline can also refine what is already a 
close approximation via planar mod- 
els. Figure 7 shows a close up of the 
planar error image, the reconstruc- 
tion error and finally the warped grid 
for the scene that was shown in Fig- 
ure 3. The planar approximation was 
not able to capture the 3D nature of 
the clothing and the non-rigid motion 
of the head with respect to the torso, 
however the spline fit captures these 
things accurately. 
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Fig. 6. Face Sequence - Interpolated views. 
(1) Original frame I ' , synthesized interme- 
diate frame, original frame /, (2) A surface 
approximation from computed dense corre- 
spondences. 




Fig. 7. Notting Hill. Detail of the spline fit 
for a layer from Figure 3, difference image 
for the planar fit, difference image for the 
spline fit, grid transformation. 
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Fig. 8. Gecko Sequence. (1) Original frame I of size 102 x 236, synthesized intermediate 
view, original frame (2) T(/), Difference image between the above image and I' (gray 
is zero error), Difference image for the planar fit. 



4.3 Gecko Sequence 

The second example, shown in Figure 8, displays a combination of a non-planar 
object (a gecko lizard), undergoing non-rigid motion. While this is a single object 
sequence, it shows the flexibility of our method to handle complicated motions. 
In Figure 8(1), the two original frames are shown as well as a synthesized in- 
termediate view (here, intermediate refers to time rather than viewing direction 
since we are dealing with non-rigid motion) . The synthesized image is a rea- 
sonable guess at what the scene would look like midway between the two input 
frames. Figure 8(2) shows T(I) as well as the reconstruction error for the spline 
fit (T(/) — /'), and the error incurred with the planar fit. We see in the second 
row of Figure 8(2) that the tail, back and head of the gecko are aligned very 
well and those areas have negligible error. When we compare the reconstruction 
error to the error induced by a planar fit, we see that the motion of the gecko 
is not well approximated by a rigid plane. Here, there is also some 3D motion 
present in that the head of the lizard changes in both direction and elevation. 
This is captured by the estimated optical flow. 



4.4 Rubik’s Cube 

The next example shows a scene with 
rigid motion of a non-planar object. 
Figure 9 displays a Rubik’s cube and 
user’s manual switching places as the 
cube rotates in 3D. Below the frames, 
we see the segmentation that is a re- 
sult of the planar approximation. It 
is important to remember that the 
background in this scene has no dis- 
tinguishing marks so there is nothing 
to say that pieces of the background 
didn’t actually move with the objects. 




Fig. 9. Rubik’s Cube. (1) Original image 
pair of size 300 x 400, (2) assignments of 
each image to layers 1 and 2. 
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Fig. 10. Rubik’s Cube - Detail. (1) Original frame /, synthesized intermediate frame, 
original frame I ' , A synthesized novel view, (2) difference image for the planar fit, 
difference image for the spline fit, T(/), the estimated structure shown for the edge 
points of I. We used dense 3D structure to produce the novel view. 

Figure 10 shows 7~(/), the result of the spline fit applied to this same scene. The 
first row shows a detail of the two original views of the Rubik’s cube as well 
as a synthesized intermediate view. Notice that the rotation in 3D is accurately 
captured and demonstrated in this intermediate view. The second row shows the 
reconstruction errors, first for the planar fit and then for the spline fit, followed 
by T(I). Notice how accurate the correspondence is since the spline applied to 
the first image is almost identical to the second frame. 

Correspondences between portions of two frames that are assumed to be 
projections of rigid objects in motion allow for the recovery of the structure of 
the object, at least up to a projective transformation. In [15], the authors show a 
sparse point-set from a novel viewpoint and compare it to a real image from the 
same viewpoint to show the accuracy of the structure. Figure 10 shows a similar 
result, however since our correspondences are dense, we can actually render 
the novel view that validates our structure estimation. The novel viewpoint is 
well above the observed viewpoints, yet the rendering as well as the displayed 
structure is fairly accurate. Note that only the set of points that were identified 
as edges in I are shown; this is not the result of simple edge detection on the 
rendered view. We use this display convention because the entire point-set is 
too dense to allow the perception of structure from a printed image. However, 
the rendered image shows that our estimated structure was very dense. It is 
important to note that the only assumption that we made about the object is 
that it is a rigid, piecewise smooth object. To achieve similar results from sparse 
correspondences would require additional object knowledge, namely that the 
object in question is a cube and has planar faces. It is also important to point 
out that this is not a standard stereo pair since the scene contains multiple 
objects undergoing independent motion. 

5 Discussion 

Since splines form a family of universal approximators over M 2 and can represent 
any 2D transformation to any desired degree of accuracy, it raises the question 
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as to why one needs to use two different motion models in the two stages of the 
algorithm. If one were to use the affine transform as the dominant motion model, 
splines with an infinite or very large degree of regularization can indeed be used in 
its place. However, in the case where the dominant planar motion is not captured 
by an affine transform and we need to use a homography, it is not practical to use 
a spline. This is so because the set of homographies over any connected region 
of M 2 are unbounded, and can in principal require a spline with an unbounded 
number of knots to represent an arbitrary homography. So while a homogra- 
phy can be estimated using a set of four correspondences, the corresponding 
spline approximation can, in principle, require an arbitrarily large number of 
control points. This poses a serious problem for robust estimation procedures 
like RANSAC since the probability of hitting the correct model decreases expo- 
nentially with increasing degrees of freedom. Many previous approaches for cap- 
turing long range motion are based on the fundamental matrix. However, since 
the fundamental matrix maps points to lines, translations in a single direction 
with varying velocity and sign are completely indistinguishable, as pointed out, 
e.g. by [16]. This type of motion is observed frequently in motion sequences. The 
trifocal tensor does not have this problem; however, like the fundamental matrix, 
it is only applicable for scenes with rigid motion and there is not yet a published 
solution for dense stereo correspondence in the presence of multiple motions. 

6 Conclusion 

In this paper, we have presented a new method for determining long range 
optical flow. We have shown that dense optical flow can now be estimated for 
image pairs with large disparity, multiple independently moving objects, and 
non-planar (including non-rigid) motion. Our approach is a two-stage framework 
based on a planar motion model for capturing the gross motion of the group 
followed by regularized spline fitting for capturing finer scale variations. 

Our approach is intentionally generic in that it requires no object knowledge. 
However, in many cases, information about the types of objects in question could 
be used. The partitioning and initial estimation of gross motion may benefit from 
the use of articulated/object models. While a general solution using known mod- 
els would require a solution to object recognition, incorporating object knowledge 
and models in specific domains will be the subject of future research. 
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Abstract. In this paper we propose an efficient real-time approach that 
combines vision-based tracking and a view-based model to estimate the 
pose of a person. We introduce an appearance model that contains views 
of a person under various articulated poses. The appearance model is 
built and updated online. The main contribution consists of modeling, 
in each frame, the pose changes as a linear transformation of the view 
change. This linear model allows (i) for predicting the pose in a new 
image, and (ii) for obtaining a better estimate of the pose corresponding 
to a key frame. Articulated pose is computed by merging the estimation 
provided by the tracking-based algorithm and the linear prediction given 
by the view-based model. 



1 Introduction 

Speed and robustness are usually the two important features of a vision-based 
face or person tracking algorithm. Though real-time tracking techniques have 
been developed and work well in laboratories (compliant users, stable and 
adapted lightning), they tend to break easily when used in real conditions (users 
performing fast moves, being occluded or only partially in the field of view of the 
camera). Tracking algorithms failures usually require a re-initialization, which 
prevents therefore their use in many applications. 

In this paper we address the problem of robustness in tracking algorithms. 
We propose an efficient online real-time approach that combines vision-based 
tracking and a view-based model to estimate the pose of an articulated object. 
We introduce an appearance model that contains views (or key frames) of a 
person under various articulated poses. The appearance model is built and up- 
dated online. The main contribution consists of modeling, in each frame, the pose 
change as a linear transformation of the view change (optical flow). This linear 
model allows (i) for predicting the pose in a new image, and (ii) for obtaining a 
better estimate of the pose that corresponds to a key frame. Articulated pose is 
computed by merging the estimation provided by the tracking-based algorithm 
and the linear prediction given by the view-based model. 

The following section discusses previous work for tracking and view-based 
models. Section 3 introduces our view-based model and shows how such a model 
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is used to predict the articulated pose in a new image. Section 4 describes our 
standard recursive tracking algorithm. We then present the general framework 
that combines recursive tracking and view-based model in Section 5. Finally we 
report experiments with our approach in Section 6 and discuss the general use 
of our approach in Section 7. 

2 Previous Work 

Vision-based tracking of articulated objects has been an active and growing 
research area in the last decade due to its numerous potential applications. Ap- 
proaches to track articulated models in monocular image sequences have been 
proposed. Dense optical flow has been used in differential approaches where the 
gradient in the image is linearly related to the model movement [2,17]. Since 
monocular motion-based approaches only estimate relative motion from frame 
to frame, small errors are accumulated over time and cause the pose estimation 
to be sensitive to drift. 

Recently, systems for 3-D tracking of hand and face features using stereo has 
been developed [8,4,9,5,12]. Such approaches usually minimize a fitting function 
error between a geometric model (limbs modeled as quadrics, cylinders, soft ob- 
jects, ...) and visual observations (tridimensional scene reconstructions, colors). 
The minimization is usually usually performed locally (initialized with the pose 
estimated at the previous frame) and therefore subject to local minima, causing 
the tracking to easily fail when, for instance, motions between frames are impor- 
tant. To prevent this pit-fall that is caused by local minima, many researchers 
investigated stochastic optimization technics such as particle filtering [13,14]. 
Though promising, these approaches are very time-consuming and cannot yet 
be implemented for real-time purposes. 

In this paper, we propose to tackle the problem of local minima in the min- 
imization of the fitting function error by recovering tracking failures using a 
view-based model. View-based models have been mainly developed for repre- 
senting the appearance of a rigid object from different points of view [10]. These 
appearance models are usually trained on images labeled with sets of landmarks, 
used for image point matching between frames, and annotated with the corre- 
sponding rigid pose. These models are able to capture the shape and appearance 
variations between people. The main drawback, however, is that the training 
phase is painstakingly long (requiring manual point matching between hundreds 
of images) and the pose estimate is very approximate. [3] recently proposed an 
approach for increasing the pose estimation accuracy in view-based models by 
using a linear subspace for shape and texture. 

Recent work has suggested the combination of traditional tracking algorithms 
with view-based models. [16] proposes a simple approach that uses a set of 
pose-annotated views to re-initialize a standard recursive tracking algorithm. 
However the approach assumes that the annotation is manual and offline. A 
similar approach is proposed in [11] where an adaptive view-based model is 
used to reduce the drift of a differential face tracking algorithm. The authors 
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introduce an interesting linear Gaussian filter that simultaneously estimates the 
correct pose of a user face and updates the view-based model. 



3 View-Based Model 

In this paper, we assume that the body model to be articulated. Pose 17 of a 
body is defined as the position of the torso and the relative orientation between 
consecutive limbs. We introduce here a view-based model M that represents the 
relationship between visual information and articulated pose 17. 

Our view-based model M consists of a collection of key frames T . Each 
key frame contains information about the visual information (view), the pose 
associated with the view and a linear transformation that relates the pose change 
with respect to the view change. Different approaches have been proposed to 
model image deformation (morphable models, active appearance models, ....). 
In this paper, we model image deformations by considering the optical flow 
around a set of support feature points f i . A key frame T is defined as: 



T = {J,x,L,17 0 } 

where J is the view (intensity image) associated with the key frame, x = 
(/i, i s a vector formed by stacking the location of the feature points 

fi- n o is the articulated pose associated with the view J. L is a matrix that 
represents the local linear transformation between the articulated pose 17 and 
the image flow between a new view I and the view J: 

II = 77 0 + L dx (1) 

where dx = x' — x is the image motion between the support points location x f 
in the image / and original support points x in image J. 

Modeling the linear transformation between articulated pose and image de- 
formation allows a compact representation of the information contained in sim- 
ilar views. Therefore it enables to span a larger part of the appearance space 
with fewer key frames. It also provides a better estimate of the articulated pose. 



3.1 Pose Prediction 

Given a new image /, not necessarily present in the view-based model, an esti- 
mation of the corresponding articulated pose 17 is obtained as follow: 

— The key frame Tk which image Jk is closest to I with respect to image 
distance dx(., .) is selected. 

— The image motion dx ^ of support points between images Jk and I is 
estimated; 

— The pose 17 is predicted as n = 77 0 (fc) +L^dx 
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Fig. 1 . The left image shows the current image. The right image shows the detected 
key frame of the view-based model, optical flow of the support points (in blue) and the 
prediction of the articulated body pose from the linear model (in white). 



In our current implementation dz(I, Jk) is defined as the weighted sum of 
absolute pixel differences between images I and J&: 



dx(I,Jk ) = I(i,j) - Jk(i,j ) I 

hi 

where (i, j) are pixel coordinates and Wij some foreground weights that account 
for the fact that pixels (i,j) in image I correspond to foreground ( Wij = 1) or 
background ( Wij = 0). Weights Wij are, in this paper, estimated by using a 
foreground detection algorithm similar to [15]. This algorithm updates online a 
background model and therefore performs a robust foreground detection, allowing 
our approach to be robust to slowly varying backgrounds 

Figure 1 shows an example of detected key frame and linear prediction from 
the view-based model. The approach we present here consists in building and 
using such a view-based model to improve the robustness and accuracy of a 
tracking-based pose estimation algorithm. 

4 Model-Based Tracking 

This section briefly describes our real-time model-based tracking algorithm pre- 
viously published in [5]. Our approach uses a force driven technique similar to 
[4,9] that allows the enforcement of different kind of constraints on the body 
pose (joint angles, orientation, ...). These constraints can eventually be learnt 
from examples using a Support Vector Machine [6] . For simplicity, only the force 
driven technique is described here. 

We consider the pose estimation problem as the fitting of a body model pose 
n to a set of visual observations. When visual observations come from a stereo 
or multi- view camera, tridimensional reconstructions V = {Mi} of the points Mi 
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Fig. 2. Our geometric-based tracking algorithm minimizes the Euclidean distance be- 
tween an articulated model (left image) and the 3D reconstruction of disparity image 
(middle image) corresponding to the scene (right image). 

in the scene can be estimated. In this case, a fitting error function E(II) defined 
as the distance between reconstructed points V and the 3D model at pose 77 is 
suitable. Such a function can be defined such that: 



where B(77) is 3D reconstruction of the body model at pose 77 and d 2 (M^, B(77)) 
the Euclidean distance between the point Mi and the 3D model 13(11). 

A direct approach for pose tracking consists in minimizing the fitting error 
E(II ) using a recursive scheme: the pose IJ t -i estimated at the previous frame is 
used as initialization in a local optimization algorithm that searches for directions 
r around 77 t -i that minimize the fitting error E(U + r). 

The iterative tracking algorithm consists of 2 steps: (i) an ICP step that 
estimates a set of unconstrained rigid motions 5 k (or forces) to apply to the artic- 
ulated body to minimize eq.(2) and (ii) an articulated constraints enforcing 
step that finds a set of rigid motions that satisfy articulated constraints while 
minimizing a Mahalanobis distance w.r.t. rigid motions 5k- The main steps of 
this tracking algorithm are recalled below. 

ICP step. Given a set of 3D data and a 3D model of a rigid object to register, 
ICP [1] estimates the motion transformation between the 3D model and the rigid 
object. The ICP algorithm is applied to each limb Ck independently, estimating 
a motion transformation 5k, and its uncertainty Ak- 

Articulated constraints enforcing. Motion transformations 5k correspond to 
’directions’ that minimize the distance between limbs Ck and the reconstructed 
3D points of the scene. However, altogether 5k do not satisfy articulated con- 
straints (due to the spherical joints between adjacent limbs). 

Let A = (Al, ..., £/v) T be the (unconstrained) set of rigid motions and = 
(5*, ..., A|v) T be a set of rigid motions satisfying articulated constraints. A correct 
set of motion transformation A* that satisfy the spherical joints constraint can 
be found by projecting the set of rigid motions 5k onto the manifold defined by 




(2) 



Miev 
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articulated motions (see [5,6] for details). The projection is linear (hypothesis of 
small angle rotations) and minimizes the following Mahalanobis distance e 2 (/T): 

e 2 (Z\*) = \\A* — A\\\ 

= (A* - A) T A~\A* - A) w 

where A is the covariance (block-diagonal) matrix A = diag(Ai, A 2 , . . .). 

The projection is written A* = P A, where P is a projection matrix whose 
entries are computed only from the covariance matrix A and the position of the 
spherical joints (before motion). 

5 Tracking with Key Frames 

This section describes how model-based tracking and the view-based model are 
combined. 

At each new frame, articulated poses are estimated independently using the 
recursive (ICP-based) tracking algorithm and the view-based model. The correct 
pose is chosen so that it minimizes the fitting error function. Figure 3 illustrates 
the combined pose estimation algorithm. 




Fusion 



update 



Fig. 3. Combined pose estimation. 



Let IJ r be the pose estimated by applying the ICP-based tracking algorithm 
(Section 4) to the pose found at the previous frame. Let U v be the prediction 
given by the view-based model (Section 3.1). II v is found by: 
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— searching for the key frame Tk = { J&, L&, i7 0fc }, which view Jk is most 

similar to the current image /; 

— estimating the optical flow dx of the support points x k between images Jk 
and I and computing II v = II 0k + L kdx. 

The fitting error function E(II) defined in (2) is evaluated at JJ r and U v . 
The pose corresponding to the smallest value of E{TI r ) and E(II V ) is considered 
as the current pose: 

17 = argmin(U(17 r ), E(II V )) 

The view-based model is built online using images I (observed during the 
tracking) and pose estimates 17. The next sections describe how new key frames 
are added in the view-based model and detail the process for updating existing 
key frames. 

5.1 Key Frames Selection 

The maximum number N of key frames in the view-based model Ai is obviously 
limited by the speed 1 and memory 2 of the CPU. Therefore the choice of key 
frames to keep in the view-based model is crucial. 

Many criteria can be considered to select the key frames (frames for which 
the tracking is accurate, frames appearing frequently, ...). In this paper, we prefer 
keeping the key frames which span a maximum of the appearance space. This can 
be done by selecting key frames that maximizes an intra-class distance V(Ai) 
between key frames. 

Let <S(JF, T') be a distance between key frames T and T' . The corresponding 
intra-class distance D(Ai) is defined as: 

v(M)= x 

{T,T'}<ZM T 

Let Tk be a key frame from the view-based model Ai and T new be a new 
key frame. If T new is such that: 

X S{T new ,T)> X S{J= k ,T) (4) 

jr eM jr 9 c :Fk T^ M ,T±T k 

then the view-based model Ai ne w obtained by replacing the key frame Tj . c by 
T new in the view-based model Ai satisfies V(M. new ) > U(At). 

In practice, we keep a current estimate of the weakest key frame Amin G Ai 
such that: 

Amin = arg min X s ^^') 

When a new frame A new satisfies (4) with Tk — T m i n , then T m i n is replaced 
by T new , therefore increasing the intra-class distance of Ai. 

1 The pose prediction algorithm involves a comparison between the current image I 
and the images of all key frames 

2 Because of real-time issues, frames cannot be stored on disk 
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5.2 Key Frame Update 

In this section, we show how the parameters x, L, U o of a key frame T = 
{ J, x, L, i7 0 } are estimated. Let Jk (with 1 < k < N)) be a set of images similar 
to J, and Ilk the corresponding articulated pose. Let dfk be the motion of a 
feature point / between the images J and Jk . 

Support points. First, support points x are estimated as the set of feature 
points fi detected as being part of the articulated object to track. In our cur- 
rent framework, support points x are chosen so that they correspond to pixels 
detected as foreground. In practice, we use the foreground weights Wij intro- 
duced in section 3.1. A pixel (i, j) is considered as a support point if its average 
foreground weight Wij across images Jk is higher than a threshold r. 



Linear model: L, iT 0 . Let dxk be the motion of the support points x = 
(/i/2-..) T between the images J and Jk . The matrix L and vector iT 0 are con- 
strained by the linear equations (1) corresponding to the observations (iT^, dxk)- 
If the number of images Jk similar to J is too small, there are not enough 
constraints (1) to estimate L and i7 0 . In the rest of this section, we assume that 
there are more constraints than entries in L and IIq. 

Solving eqs.(l) directly using a linear least square technique could lead to 
biased estimates of L and i7 0 because (i) the noise in the entries Ilk is not 
uniform and isotropic and (ii) the image motion of some of the support points x 
may be mis-estimated due, for instance, to the aperture problem or the presence 
of similar textures. Therefore we propose a robust scheme to solve for L and iT 0 
that accounts for the presence of outliers in dxk . 

Eq.(l) can be rewritten: 

dx k = L~ 1 (n k - n 0 ) = rn k + n ( 5 ) 

with 

r = L -1 n = — L _1 iT 0 (6) 

Let the matrices and vectors Hi be such that T = (r'i T ...r'jv / T ) T and 

H = (/ii T .../i JVj , T ) T . 

With dx k = ( f[ k) ■ ■■■■ fN tk) ) T and considering only the lines of (5) 

corresponding to the support point motion df^ k \ it gives: 

r- xT \ 

p yT J n k + = P kQi (7) 



df(k) = p, JJ k _l_ 



where P& 



(n k T o io\ 
1 o n k T oi J 




Vector qi is found by solving simultaneously eqs.(7) for all k using a robust 



optimization technique based on M-estimator [7]. More precisely, we introduce 
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an influence function p(x, a) = log(l + ^ 2 ) and minimize the following objective 
function: 

^2p(\\dfi k) ~ F kqi\\,(r) (8) 

k 

The scalar a corresponds to the expected covariance of the noise in the inliers 
(in our implementation, a = 2.0 pix). It worth noticing that eq.(8) is actually 
solved using an iterative weighted linear least-square method (see [7] for details). 
Once vectors qi are estimated, L and 77 q are estimated using (6). 

5.3 Summary 

The complete tracking algorithm can be summarized as follow: 

— Key frame search. The key frame = { J&, L/e, iT 0 fc} of the view- 
based model, which image is the closest to the current image / is esti- 
mated; 

— Pose estimation. Pose II v is predicted using the linear model (1) and 
optical flow dx between image / and J&. Pose U r is estimated using the 
ICP-based algorithm. The pose minimizing the fitting error function (2) is 
chosen as the correct pose 77; 

— View-based model update. The optical flow dx is added as an additional 
constraint to update the linear model (L&, 77 0/e ) of key frame If image 
I satisfies criteria (4), then a new key frame J~ new is created (with image I 
and pose 77). 

6 Experiments 

We applied the body tracking approach described previously to stereo image 
sequences captured in our lab. Experiments were done in order to compare 
the standard recursive (ICP-based) algorithm with our approach (ICP-based 
combined with a view-based model). The algorithms were run on a Pentium 4 
(2GHz). The ICP-based algorithm alone runs at a speed ranging from 8Hz to 
12Hz. The ICP-based algorithm combined with a view-based model runs at 
about 5Hz. In these experiments, the maximum number of key frames in the 
view-based model is N = 100. 

In order to learn the view-based model, a training sequence of about 2000 
images is used. The training sequence is similar to Figure 4 (same back- 
ground / subject) . 

Figure 4 show some comparative results on a testing sequence of more than 
1500 images. More exactly, the figure show the corresponding images of the 
sequence and re-projection of the 3D articulated model for frames 132, 206, 
339, 515, 732 and 850. Results show that our approach enables to cope with 
re-initialization after tracking failure. 

Figure 5 shows the average error between the estimation of the 3D model 
and the 3D scene reconstruction from the stereo camera for the two algorithms. 
Additional sequences can be found at : ht tp : // www . ai . mit . edu/ ^demird j i . 
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Fig. 4. Comparative results (re-projection of the 3D articulated model) on a sequence 
of more than 1500 images (lines correspond to frames 132, 206, 339, 515, 732 and 
850). The graph shows that, with our approach (ICP + view-based model), the error 
is always smaller. The left column corresponds to the ICP-based tracking algorithm. 
The right column corresponds to our algorithm (ICP + view-based model). 
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Fig. 5. Average error between the estimation of the 3D articulated model and the 3D 
scene reconstruction vs. number of frames. Peaks in the data (around frames 520, 670, 
930, 1100, 1190) corresponding to the ICP algorithm are actually tracking failures. 



7 Conclusion 

We described an approach for real-time articulated body tracking. The approach 
combines traditional recursive vision-based tracking and a view-based model to 
estimate the pose of an articulated object. We introduce an appearance model 
that contains views (or key frames) of a person under various articulated poses. 
The appearance model is built and updated online. The main contribution con- 
sists in modeling, in each frame, the pose change as a linear transformation of 
the view change. 

The experiments we carried out show that our approach significantly in- 
creases the robustness of the tracking by enabling an automatic re-initialization 
in case of failure of the traditional recursive tracking algorithm. Experiments 
are being carried out to show the accuracy of the linear predictor of the view- 
based model. The use of an online background learning algorithm allows our 
approach to be robust to slowly varying background. However, our approach is 
not robust to different clothing/person. In future work, we plan to extend our 
approach by introducing an adaptive appearance model to model the variability 
of appearance across people/clothes. 
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Abstract. We present an algorithm for shape matching and recognition 
based on a generative model for how one shape can be generated by 
the other. This generative model allows for a class of transformations, 
such as affine and non-rigid transformations, and induces a similarity 
measure between shapes. The matching process is formulated in the EM 
algorithm. To have a fast algorithm and avoid local minima, we show how 
the EM algorithm can be approximated by using informative features , 
which have two key properties -invariant and representative. They are 
also similar to the proposal probabilities used in DDMCMC [13]. The 
formulation allows us to know when and why approximations can be 
made and justifies the use of bottom-up features, which are used in a wide 
range of vision problems. This integrates generative models and feature- 
based approaches within the EM framework and helps clarifying the 
relationships between different algorithms for this problem such as shape 
contexts [3] and softassign [5] . We test the algorithm on a variety of data 
sets including MPEG7 CE-Shape-1, Kimia silhouettes, and real images of 
street scenes. We demonstrate very effective performance and compare 
our results with existing algorithms. Finally, we briefly illustrate how 
our approach can be generalized to a wider range of problems including 
object detection. 



1 Introduction 

Shape matching has been a long standing problem in computer vision and it 
is fundamental for many tasks such as image compression, image segmentation, 
object recognition, image retrieval, and motion tracking. A great deal of effort 
has been made to tackle this problem and numerous matching criteria and algo- 
rithms have been proposed. For example, some typical criteria include Fourier 
analysis, moments analysis, scale space analysis, and the Hausdorff distance. For 
details of these methods see a recent survey paper [14] . 

The two methods most related to this paper are shape contexts [3] and 
softassign [5]. Shape contexts method is a feature-based algorithm which has 
demonstrated its ability to match certain types of shapes in a variety of appli- 
cations. The softassign approach [5] formulates shape registration/matching as 
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free energy minimization problem using the mean field approximation. Recent 
improvements to these methods include the use of dynamic programming to im- 
prove shape contexts [12] and the Bethe-Kikuchi free energy approximation [9] 
which improves on the mean field theory approximation used in the softassign [5] . 

Our work builds on shape contexts [3] and softassign [5] to design a fast and 
effective algorithm for shape matching. Our approach is also influnced by ideas 
from the Data-Driven Markov Chain Monte Carlo (DDMCMC) paradigm [13] 
which is a general inference framework. It uses data-driven proposals to activate 
generative models and thereby guide a Markov Chain to rapid convergence. 

First, we formulate the problem as Bayesian inference using generative mod- 
els allowing for a class of shape transformations, see section (2). In section (3), 
we relate this to the free energy function for the EM algorithm [8] and, thereby, 
establish a connection to the free energy function used in softassign [5] . 

Secondly, we define a set of informative features , which observe two key 
properties: invariant/ semi- invariant and representative , to shape transforma- 
tions such as scaling, rotation, and certain non-rigid transformations, see sec- 
tions (4. 1,4.2). Shape contexts [3] are examples of informative features. 

Thirdly, the generative model and informative features are combined in the 
EM free energy framework, see section (4. 3, 4. 4). The informative features are 
used as approximations, similar to the proposals in DDMCMC [13], which guide 
the algorithm to activate the generative models and achieve rapid convergence. 
Alternatively, one can think of the informative features as providing approx- 
imations to the true probabilities distributions, similar to the mean field and 
Bethe-Kikuchi approximations used by Rangarajan et al [5], [9]. 

We tested our algorithm on a variety of binary and real images and obtained 
very good performance, see section (6). The algorithms was extensively tested on 
binary datasets where its performance could be compared to existing algorithms. 
But we also give results on real images for recognition and detection. 

2 Problem Definition 

2.1 Shape Representation 

The task of shape matching is to match two arbitrary shapes, X and T, and to 
measure the similarity (metric) between them. Following Grenander’s pattern 
theory [6] , we can define shape similarity in terms of the transformation F that 
takes one shape to the other, see Fig. 1. In this paper we allow two types of 
transformation: (i) a global affine transformation, and (ii) a local small and 
smooth non-rigid transformation. 

We assume that each shape is represented by a set of points which are either 
sparse or connected (the choice will depend on the form of the input data). 

For the sparse point representation, we denote the target and source 
shape respectively by: 

X = {(Xi) : i = 1, M}, and Y = {(y n ) : a = 1, N}. 
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(a) Target shape X (b) Transformation F (c) Source shape Y 



Fig. 1. Illustration of a shape matching case in which a source shape Y is matched 
with a target shape X through a transformation function F. 

This representation will be used if we match a shape to the edge map of an 
image. 

For the connected point representation, we denote the target and source 
shape respectively by: 



where s and t are normalized arc-length distances. This model is used for match- 
ing shapes to silhouettes. (The extension to multiple contours is straightforward.) 

2.2 The Probability Models 

We assume a shape X is generated by a shape Y by a transformation F = (A, f) 
where A is an affine transformation, and f denotes an non-rigid local transfor- 
mation (in thin-plate-splines (TPS) [ 4 ], the two transformations are combined, 
but we separate them here for clarity). For any point y a on Y, let v a G {0..M} 
be the correspondence variable to points in X. For example, v a = 4 means that 
point y a on Y corresponds to point X4 on X. If v a = 0 , then y a is unmatched. 
We define V = (v a ,a = 1..N). The generative model is written as 



E d (X, Y, V, (A, f)) = £(1 - SK))||x„ a - Ay a - f(y a )|| 2 /<7 2 . (1) 



and (1 — S(v a )) is used to discount unmatched points (where v a = 0). There is 
a prior probability p(V) on the matches which pays a penalty for unmatched 
points. Therefore, 



X = {(x(s)) : s 6 [0, 1]}, and Y = {(y(t) : t € [0, 1]}, 



p(X\Y, V, (A, f)) oc exp{-E D (X, Y, V, (A, f))}, 



where 



a 



P(X, V\Y, V, (A, f)) x exp{-E T (X, Y, V, (A, f))} 

where E T (X, Y, V, (. A , f)) = E D (X, Y, V, (A, f)) - logp(V). 

The affine transformation A is decomposed [1] as 




cosO —sinO 
sinO cosO 
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where 0 is the rotation angle, S x and S y denote scaling, and k is shearing. The 
prior on A is given by p(A) oc exp{-E A (A)} where E a (A) = E rotation {6 ) + 

E scaling xi $y') T -E's/iearmg (^) • 

The prior on the non-rigid transformation f is given by 



p{ f) OC exp{-Ef( f)}, and E f ( f) = A / Y c rn {D m i) 2 dy , 

m = 0 

The {c m } are set to be a 2m /(ra) 2 m (Yuille and Grzywacz [15]). This enforces 
a probabilistic bias for the transformations to be small (the Co = 1 term) and 
smooth (the remaining terms {q : i > 1}. It can be shown [15], that f is of the 
form £(x) = JT aiG(x — Xi ) where G(x) is the Green’s function of the differential 
operator. We use the Gaussian kernel for f in this paper (alternative kernels such 
as TPS give similar results). 

The generative model and the prior probabilities determine a similarity mea- 
sure: 

D(X\\Y) = -log p(X\Y) = -log f J2p( x ^, A f)\Y)dAdf. (2) 

^ V 



Unfortunately evaluating eqn. (2) requires integrating out (A, f) and sum- 
ming out V. Both stages are computationally very expensive. Our strategy is 
to approximate the sum over the V, by using the informative features described 
in section (4). We then approximate the integral over (A, f) by the modes of 
p(A,f\X,Y) (similar to a saddle point approximation). Therefore we seek to 
find the (A, f)* that best represent the distribution: 

[ V, (A, i)\Y)dAdi ~ Par(p(X, (A, f)*|F)) (3) 

J V 



where Par is a Parzen window. Our experiments show that the integral is almost 
always dominated by (A,f)*. Therefore, we approximate the similarity measure 
by: 

D Appox (X\\Y) = - log tWin (4) 

V 

where 



(A,f)* = argmaxyy(X, V, (A, f)|F) 

(A,f ) v 

= arg min - log ^p{X, V\ Y,V, {A, f))p(A)p({). (5) 

(A ’ f ) y 

In rare cases, we will require the sum over several models. For example, three 
modes ((A, f)*, (A, f)J, (A, f)J)) are required when matching two equal lateral 
triangles, see Fig. (2). 

Note that this similarity measure is not symmetric between X and Y. But 
in practice, we found that it was approximately symmetric unless one shape was 
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significantly larger than the other (because of how the A scales the measure). 
To avoid this problem, we can compute D(X\\Y) + D(Y\\X). The recognition 
aspect of the algorithm can be naturally extended from the similarity measure 
for the two shapes. 



3 The EM Free Energy 

Computing (A, f )* in equation (5) requires us to sum out the hidden variable V. 
This fits the framework of the EM algorithm. It can be shown [8] that estimating 
(A, f)* in eqn. (5) is equivalent to minimizing the EM free energy function: 

E(P, (A, f)) = - Y,P(V) lo S KV V\Y, {A, f)) - log p(A, f) + EpOO log p(V) 

V V 

= 5 2p(V)Et(X,Y. , V,(A, f)) + E a (A) + E f ( f) + ^2p(V) log p(V). 

V V 

(6) 

The EM free energy is minimized when p(V) = p(V \X, Y, A, f). The EM al- 
gorithm consists of two steps: (I) The E-step minimizes E(p, ( A , f)) with respect 
to p(V) keeping (A, f) fixed, (II) The M-step minimizes E(p, ( A , f)) with respect 
to (A, f) with p(V) fixed. But an advantage of the EM free energy is that any 
algorithm which decreases the free energy is guaranteed to converge to, at worst, 
a local minima [8] . Therefore we do not need to restrict ourselves to the standard 
E-step and M-step. 

Chui and Rangarajan’s free energy [5], 

N N N K N K 

i= 1 a= 1 i= 1 a= 1 i= 1 a= 1 

(7) 

can be obtained as a mean field approximation to the EM free energy. This 
requires assuming that p(V) can be approximated by a factorizable distribution 
Yl a P{v a ). The soft-assign variables m a i G [0,1] are related to p(V) by m a i = 
P(v a = i). An alternative approximation to the EM free energy can be done by 
using the Bethe-Kikuchi free energy [9]. 

Like Rangarajan et al [5,9] we will need to approximate p(V) in order to 
make the EM algorithm tractible. Our approximations will be motivated by 
informative features, see section (4), which will give a link to shape contexts [3] 
and feature-based algorithms. 



4 Implementing the EM Algorithm 

In this section, we introduce informative features and describe the implementa- 
tion of the algorithm. 
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(a) (b) (c) (d) (e) (f) 

Fig. 2. The distribution p(0\X,Y), shown in (f), has three modes for a target shape 
X , shown in (a), and a source shape Y, shown in (b). (c), (d), and (e) respectively 
display the three possible values for the 6. 



4.1 Computing the Initial State 

The EM algorithm is only guaranteed to converge to a local minima of the 
free energy. Thus, it is critical for the EM algorithm to start with the “right” 
initial state. Our preliminary experiments in shape matching suggested that the 
probability distribution for (A, f) is strongly peaked and the probability mass is 
concentrated in small areas around {(A, f)*, (A, f)^, (H, f)g, ...}. Hence if we can 
make good initial estimates of (A,f), then EM has a good chance of converging 
to the global optimum. 

The rotation angle 0 is usually the most important part of (A, f) to be esti- 
mated. (See Fig. 2 for an example where there are three equally likely choices for 
0.) It would be best to get the initial estimate of 6 from p(6\X,Y), but this re- 
quires integrating out variables which is computationally too expensive. Instead, 
we seek to approximate p(0\X,Y) (similar to the Hough Transform [2]) by an 
informative feature distribution pip(0\X,Y ): 

p{0\X,Y)np IF (0\X,Y) = EE 4(</>(x*), <t>(y a ))S(0 - 9(a, i, X, Y)), (8) 

i a 

where </>(x^) and </>( y a ) are informative features for point x$ and y a respectively, 
q(x.i, y a ) is a similarity measure between the features, and 6(X,Y,a,i) is the 
angle if the ith point on X is matched with ath point on Y. 

Next, we describe how to design the informative features 0(x^) and the sim- 
ilarity measures g(0(xi),0( y a )). 

4.2 Designing the Informative Features 

The informative features are used to make computationally feasible approxima- 
tions to the true probability distributions. They should observe two key proper- 
ties to have 



I p(0\x, Y, (A_e, i))p{A_e)p(t)dA_edi « p(0\</>(X), </>(¥)) 

(I) They should be “invariant” as possible to the transformations. Ideally 
p{0\4>{X), 4>(Y), (A_ e , /)) = p(0\4>(X), <j>(Y)). 
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(b) Similarity measure on the features 



Fig. 3. Features and the similarity measure of the features, a) Illustrates how the local 
and global features are measured for connected points. In b)., the features of two points 
in shape X and Y are displayed. The top figure in the middle of b). shows similarities 
between point a in Y w.r.t. all points in X using the shape context feature. The other 
two figures in the middle of b). are the similarities between points a and b in Y w.r.t. 
all points in X respectively. As we can see, similarities by features defined in this paper 
for connected points have lower entropy than those by shape contexts. 



(II) They should be “representative” . For example, we would ideally have 

I p(e\X,Y,(A_ e ,i))p(A_o)p(i)dA_ e d{ = j p(9\ct>(X),ct>(Y), (A_ e ,f))p(A_o)p(f)dA_ e df 

where A_q is the components of A except for 0 and (j>(X),<j)(Y) are the feature 
vectors for all points in both images. 

The two properties for imformative features are also used to approxi- 
mate distribution of other variables, for example, p(V\X,Y), which requires 
us to integrate out (A, f) and can be approximated by pif(V\(/)(X), </)(Y)) = 

n<,9W x «a)^(y«))' 

In this paper we select the features (/>(.) and measures g(., .) so as to obtain 
Pif with low- entropy. This is a natural choice because it implies that the features 
have low matching ambiguity. We can evaluate this low- entropy criteria over our 
dataset for different choices of features and measures, see figure (3). 

A better criteria, though harder to implement, is to select the features and 
measures which maximize the conditional Kullback-Leibler divergence evaluated 
over the distribution p(X, Y) of problem instances: 
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J2p(X,Y) P(V, (A ,i)\X,Y) log 

X,Y (A, f) 



p(V,(A,f)\X,Y) 

VlF (v,(A^mx)^(Y)y 



(9) 



but full evaluation of this criterion is a task for future work. 

We used the low-entropy criterion to devise different sets of features for the 
two cases, shapes of connected points representation and shapes of sparse points 
representation. 



Case I: The Connected Point Representation 

We use local and global features illustrated by Fig. 3. The local features at a 
point x(si) with tangent ^ are defined as follows. Choose six points on the curve 
by (x(si-3ds),x(si — 2ds),x(si—ds),x(si-hds),x(si-h2ds),x(si-\-3ds)), where ds 
is a (small) constant. The angles of these positions w.r.t. point are = 

1..6). The local features are hi(xi) = (uj , j = 1..6). The global features are 
selected in a similar way. We choose six points near x(si), with tangent Vy to be 
(x(s^ — 3Z\s), x(si — 2 As), x(si — As), x(,^ + Z\s), x(s^-f 2Z\s), x(s^ + 3Z\s)), where 
As is a (large) constant, with angles fji + pj : j = 1 , ..., 6 . The global features are 
h g (xi) = (<Pj,j = 1..6). Observe that the features <f = (h F h g ) are invariant to 
rotations in the image plabe and also, to some extent, to local transformations. 

In Fig. (3).b., for display purposes we plot sinusoids (sin(hi), sin(h g )) for 
two points on the X and two points on the Y. Observe the similarity between 
these features on the corresponding points. 

The similarity measure between the two points is defined to be: 

6 6 

qM^Aiya)) = 1 - Ci(^2D ang i e (u}j(Xi) — LOj{y a )) + D angle {^j( x i) - ViAa))), 

3 = 1 J=1 

where D ang i e (ujj(xi) — Uj(y a )) is the minimal angle from L 0 j(xi) to c Oj(y a ), and 
ci is a normalization constant. The second and the third row in the middle of 
Fig. (3).b. respectively plot the vector q c ( y) = [g c (<K x i), 0(y)), i = 1..M] as a 
function of i for points y a and on Y. 

Case II: The Sparse Point Representation 

In this case, we also use local and global features. To obtain the local feature 
for point x^, we draw a circle with a (small) radius r and collect all the points 
that fall into the circle. The relative angles of these points w.r.t. x^ and x^’s 
tangent angle are computed. The histogram of these angles is then used as the 
local feature, Hi. 

The global feature for the sparse points is computed by shape contexts [3]. 
We denote it by H g and the features become <f = (H F H g ). 

The feature similarity between two points x^ and y a is measured by the y 2 
distance: 



qs(H x i)A(ya)) = 1 - c 2 (x 2 (Hi(xi), Hi(y a )) + X 2 (Hg(xi), Hg(y a ))). 
The first row in the middle of Fig. (3).b. plots the vector 

q s ( y«) = [q s (H x i)A(y a )),i = 1 -M] 

as a function of i for a point y a on T. 
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The advantage of the sparse point representation is that it is very general 
and does not require a procedure to group points into contours. But for this very 
reason, the features and measures have higher entropy than those for the con- 
nected point representation. In particular, the global nature of the shape context 
features [3] means that these features and measures tend to have high entropy, 
see the Fig. (3)b., particularly shape context features are also of unnecessarily 
high dimension - consisting of 2D histograms with 60 bins - and better results, 
in terms of entropy, can be obtained with lower dimensional features. 



4.3 The E Step: Approximating p(V ) 

We can obtain an approximation pjf(0\X,Y) to p(6\X,Y), see equation (8), 
using the informative features and similarity measures described in the previous 
section. We select each peak in pjf(0\X,Y) as an initial condition 6 initial for 
0. The same approach is used to estimate the other variables in A and f from 
p(V\X,Y, 6 i n i t iai). We use similar informative features to those described in the 
previous section except that we replace i); by 9 initial 

hi = (k'j jiJ = 1--6) = {oij hinitidlij = 1--6), 

and 

h' g = Wyj = 1..6) = (ft - 0 initi alJ = 1-6). (10) 

We also augment the similarity measure by including the scaled relative po- 
sition of point to the center of the shape x = -^ JT x^: 

6 

q'c(<t>( x i )> <Mya)) = 1 - ci Y^[ D angle(Uj(Xi) - w'-(y a )) + - </?'(?/«)))] 

3 = 1 

- 4ll x i - x >yo -y|| 2 - 



Thus, we have the following approximation: 

p(V\X,Y,0)*p IF (V\X,Y,0) = l[pi f (v a \y a ,X,Y,e). (11) 

a 



where 



Pif(v a = i\y a ,X,Y,6) 



gc<Xx»),<?H y a )) 
Ejio q' c (<t>(xj),<l>(ya)) 



After the first iteration, we update the features and feature similarity measure 
by q' c (0(xi), 4>{{A + f)(y a ))) and use them to approximate p(V\X,Y, (A, f)) as 
in eqn (11). 



4.4 The M Step: Estimating A and f 

Once we have an approximation to p(V), we then need to estimate (A, f) ac- 
cording to eqn. (6). We expand E(p, (A, f)) as a Taylor series in A, f keeping the 
second order terms and then estimate (A, f) by least squares. 
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5 Summary of the Algorithm 

Our algorithm is performed by an approximation to the EM algorithm and it 
proceeds as follows: 

1. Given a target shape X and a source shape Y, it computes their informative 
features described in section 4.2 and uses Pif(0\X,Y) (equation (8)) to 
obtain several possible rotation angles 

2. For each rotation angle 6 initial , we obtain a new shape Y' by rotating it for 

@ initial • 

3. Update features for shape Y' and estimate p(V\X, Y, 6) by Pjf(V\XY, 0) as 
eqn. (11). 

4. Estimate (A, f) from the EM equation by least-squares method. 

5. Obtain the new shape Y' by the transformation function Ay + f(Y). Repeat 
step 3 for 4 iterations. 

6. Compute the similarity measure and keep the best (A, f)*, among all the 

initial and compute the metric according to eqns. (3) and (2). (We 

can also combine the results from several starting points to approximate eqn. 
2. In practice, we found there is not much difference except for special cases 
like the equal lateral triangle.) 

The algorithm runs at 0.2 seconds for matching X and Y of around 100 
points. Note that our method does not need the target shape X and the source 
shape Y to have the same or nearly the same number of points, which is a key 
requirement for many matching algorithms. 

6 Experiments 

We tested our algorithm on a variety of data sets and some results are reported 
in this section. Fig. 4 shows the running example where the source shape Y in 
(d) is matched with the target shape X. Fig.4.e and .f show the transformation 
A* and f* estimated. 

6.1 MPEG7 Shape Database 

We first tested our algorithm on the MPEG7 CE-Shape-1 [7] which consists of 
70 types of objects each of which has 20 different silhouette images (i.e. a total of 
1400 silhouettes). Since the input images are binarized, we can extract contours 
and use the connected point representation. Fig. 5. a displays 2 images for each 
type. The task is to do retrieval and the recognition rate is measured by “Bull’s 
eye” [7]. For every image in the database, we match it with every other image and 
keep the 40 best matched candidates. For each one of the other 19 of the same 
type, if it is in the selected 40 best matches, it is considered as a success. Observe 
that the silhouettes also include mirror transformations which our algorithm 
can take into account because the informative features are computed based on 
relative angles. The recognition rates for different algorithms are shown in table 
1 [10] which shows that our algorithms outperforms the alternatives. The speed 
is in the same range as those of shape contexts [3] and curve edit distance [10]. 
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(a) Target shape X (b)(A + f)(T) at step 1 (c) (A + f)(T) at step 4 




Fig. 4. Shape matching of the running example. 
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(a) Some typical images in MPEG7 (b) The four types with the lowest rates 

CE-Shape-1 

Fig. 5. Matching as image retrieval for the MPEG7 CE-Shape-1. 



Table 1. The retrieval rates of different algorithms for the MPEG7 CE-Shape-1. Re- 
sults by the other algorithms are from Sebastian et al. [10]. 



Algorithm 


CSS 


Visual Parts 


Shape Contexts 


Curve Edit Distance 


Our Method 


Recognition Rate 


75.44% 


76.45% 


76. 51% [3] 


78.17% [10] 


80.03% 



6.2 The Kimia Data Set 

We then tested the identical algorithm (i.e. connected point representation and 
same algorithm parameters) on the Kimia data set of 99 shapes [11], which are 
shown in Fig. 6. a. For each shape, the 10 best matches are picked since there 
are 10 other images in the same category. Table 2 shows the numbers of correct 
matches. Our method performs similarly to Shock Edit [11] for the top 7 matches, 
but is worse for the top 8 to 10. Shape context performs less well than both 
algorithms on this task. Fig.6.b. displays the fifteen top matches for some shapes. 
Our relative failure, compared with Shock Edit, is due to the transformations 
which occur in the dataste, see the 8-10 th examples for each object in figure (6), 
and which require more sophisticated representations and transformations than 
those used in this paper. 
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(a) The 99 silhouette images of the Kimia data set. 
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(b) Some matching results by our method (c) Some matching results by Shock Edit 



Fig. 6. The Kimia data set of 99 shapes and some matching results. 



Table 2. Numbers of matched shapes by different algorithms. Results by the other 
algorithms are due to Sebastian et al. [11]. 



Algorithm 


Top 1 


Top 2 


Top 3 


Top 4 


Top 5 


Top 6 


Top 7 


Top 8 


Top 9 


Top 10 


Shock Edit 


99 


99 


99 


98 


98 


97 


96 


95 


93 


82 


Our Method 


99 


97 


99 


98 


96 


96 


94 


83 


75 


48 


Shape Contexts 


97 


91 


88 


85 


84 


77 


75 


66 


56 


37 




iwebster After 7:00pm 



*M£ PROHIBITED 



(a) Some typical text images. 




Fig. 7. Results on some text images, (e) and (i) display the matching. We purposely 
put two shapes together and find that the algorithm is robust in this case. 



6.3 Text Image Matching 

The algorithm was also tested on real images of text in which binarization was 
performed followed by boundary extraction. Some examples are shown in Fig. 7. 
Similar results can be obtained by matching the model to edges in the image. 
Further tests on this dataset are ongoing. 
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Fig. 8. Some results on Chui and Rangarajan data set. 




Fig. 9. Result on a hand image. 



6.4 Chui and Rangarajan 

To test our algorithm as a shape registration method, we also tried the data set 
used by Chui and Rangarajan [5]. We used the sparse point representation in 
this case. The algorithm runs for 10 steps and some results are shown in Fig. 8. 
The quality of our results are similar to those reported in [5] . But our algorithm 
runs an estimated 20 times faster. 
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6.5 A Detection Task 

Our algorithm can also be used for object detection where, unlike recognition, 
we do not know where the object is in the image. To illustrate this, we tested 
our algorithm on a hand image used in [14]. Edge points were extracted to act as 
the target shape and the source image was a hand represented by sparse points. 
The result is shown in Fig. 9. 

7 Discussion 

This paper introduced a criterion for shape similarity and an algorithm for com- 
puting it. Our approach helps show relations between softassign [5] and shape 
contexts [3]. We formulated shape similarity by a generative model and used a 
modified variant of the EM algorithm for inference. A key element is the use of 
informative features to guide the algorithm to rapid and correct solutions. We 
illustrated our approach on datasets of binary and real images, and gave com- 
parison to other methods. Our algorithm runs at speeds which are comparable 
to alternatives and is faster than others by orders of magnitude. 

Our work is currently limited by the types of representations we used and 
the transformations we allow. For example, it would give poor results for shape 
composed of parts that can deform independently (e.g. human figures). For such 
objects, we would need representations based on symmetry axes such as skele- 
tons [10] and parts [16]. Our current research is to extend our method to deal 
with such objects and to enable the algorithm to use input features other than 
edge maps and binary segmentations. 
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Abstract. We propose Generalized Histogram as low-dimensional representa- 
tion of an image for efficient and precise image matching. Multiplicity detection 
of videos in broadcast video archives is getting important for many video-based 
applications including commercial film identification, unsupervised video pars- 
ing and structuring, and robust highlight shot detection. This inherently requires 
efficient and precise image matching among extremely huge number of images. 
Histogram-based image similarity search and matching is known to be effective, 
and its enhancement techniques such as adaptive binning, subregion histogram, 
and adaptive weighting have been studied. We show that these techniques can be 
represented as linear conversion of high-dimensional primitive histograms and 
can be integrated into generalized histograms. A linear learning method to obtain 
generalized histograms from sample sets is presented with a sample expansion 
technique to circumvent the overfitting problem due to high-dimensionality and 
insufficient sample size. The generalized histogram takes advantage of these tech- 
niques, and achieves more than 90% precision and recall with 16-D generalized 
histogram compared to the ground truth computed by normalized cross correla- 
tion. The practical importance of the work is revealed by successful matching 
performance with 20,000 frame images obtained from actual broadcast videos. 



1 Introduction 

Recent advance in broadband networks and digital television broadcasting enables huge- 
scale image and video archives in the WWW space and broadcast video streams. In 
these huge-scale image and video archives, multiplicity detection is getting important. 
For example, Cheung et al. [1] proposed a multiplicity detection method for video 
segments in the WWW to detect possible unlawful copy or tampering of videos. On the 
other hand, some researchers notice that multiplicity detection is especially useful for 
broadcast video archives [2,3]. In this paper, by multiplicity detection for video archives, 
we assume that pairs of distinctively similar (or almost identical) video segments can be 
extracted from videos; i.e., they are originated from the same video materials, but with 
different transmission conditions or different post production effects, e.g., video captions, 
scale, shift, clip, etc. Figure 1 shows typical examples of multiplicity in video archives. 
They include (a) commercial films, (b) opening/ending of programs, (c) opening/ending 
of some types of scenes, (d) video shots of an important event, etc. Detection of (a) 
is particularly important for a company that needs to check if its commercial films are 
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(a) commercial film (b) news opening (c) weather forecast (d) sports highlight 



Fig. 1 . Examples of identical video segments 

properly broadcasted. Video segments (b) and (c) themselves are not very important, 
however, since they represent “punctuation” of video streams, they are useful for video 
parsing and structuring; i.e., identification of particular programs or locating breaks 
between news stories. Multiplicity detection may realize unsupervised video parsing 
(especially for news videos) without any a priori knowledge on visual features of the 
punctuation shots [4]. Highlight shots of important events such as Olympic games or 
world cup soccer games including (d) tend to be repeated; for instance, highlight shots 
in on-the-spot broadcasting of sports may repeatedly used in several news or sports news 
as “play of the day.” Thus multiplicity detection may help to detect highlight shots. 

Despite its importance, since identical video segment pairs are essentially very rare, 
multiplicity detection could become meaningless unless it is applied to huge-scale video 
archives. Therefore precise matching between images is required; otherwise, desired 
identical video segment pairs may be buried under large number of mistakenly detected 
pairs. While at the same time, the method should allow slight modification to images, 
so that it is required to ignore small video captions, slight clipping, etc. In order to 
satisfy the contradictory requirements, pixelwise comparison is needed, however, is 
very time consuming. In order to apply to huge-scale video archives, severe speedup 
should be addressed. Compact (low-dimensional) representation of images is necessary 
for efficient image matching, while preserving precise matching performance. 

In this paper, we propose generalized histogram as low-dimensional representation 
of an image which realizes precise yet at the same time efficient image matching. 
The generalized histogram is histogram-based low-dimensional representation; thus 
it achieves very efficient matching process. Several techniques have been studied for 
enhanced histogram representation, including adaptive binning, subregion histogram, 
and adaptive weighting. We show that the generalized histogram takes advantage of 
these techniques at the same time. Successful results are shown by detecting multiplicity 
in actual video footages taken from television broadcast to reveal effectiveness of the 
generalized histograms. 

2 Multiplicity Detection in Large Collection of Images 

As described, multiplicity detection in video archives should be applied to huge-scale 
video archives. Since video archives obviously contain huge amount of frame images, the 
method should discover identical image pairs among huge number of images. Strong 
correlation between consecutive frames in a video allows most frames to be skipped 
in a certain period. However, in order not to miss many identical video segments with 
dynamic motion, the period should be short enough (ten frames or so). Consequently, dis- 
covery of identical image pairs between large image sets is an inevitable key technology. 
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Assume that two sets of images IS 1 = {I}} and IS 2 = {I 2 } are given. We use 
normalized cross correlation (NCC) for the similarity check here using a threshold 
— 1 <C Oh < 1. The resultant identical image pair set IIPS is thus: 



IIPS = {(I 1 ,/ 2 ) I I 1 e IS 1 ,! 2 e JS^NCC(/\/ 2 ) > 6> h } 



NCC(/ 1 ,/ 2 ) = 



E^( j1 - j1 )( 72 - /2 ) 



{ Ex,y ( J1 - Jl ) 2 Ez,„( /2 - / 2 ) 2 } 1/2 



^,7/V 



where, for notational convenience, I 1 and I 2 represent intensity value of each image at the 
location (x,y), and I is its mean. For simplicity, we assume that all images in collections 
are monochrome, but extension to color images can easily be achieved. It obviously is 
intractable to calculate NCC for every combination of images. Our implementation 
takes about 16ms on an ordinary PC to compute NCC between two 352 x 240 images. 
For example, to extract II PS from two sets of 100,000 images, it will take 16ms x 
100, 000 x 100, 000 ~ five years. Since even one hour-long video (30fps) contains about 
100,000 frames, intensive speedup is required. 

In order to address efficient search of database which requires costly similarity (or 
distance) calculation, a “dirty filtering” technique is sometimes used [5]. In this, each 
element in the database is first converted into a low-dimensional (ten to twenty dimen- 
sions) vector by some function /. Benefit of dimensionality reduction is two fold: Firstly, 
metric calculation becomes much lighter, because the calculation is basically propor- 
tional to the number of dimension (352 x 240 versus 10 ^ 20). Secondly, data of less 
than 100 dimensions are very suitable to the high-dimensional indexing techniques us- 
ing tree-type structures and/or hash [6], which further accelerate the search speed. Then 
approximation of II PS is obtained as follows: 



UPS = {(I 1 , 1 2 ) I I 1 e IS 1 , 1 2 e lS 2 ,d (/(J 1 ),/^ 2 )) < 6 d } 



where d(-) represents the distance between vectors, II PS is the approximation of 
II PS, and 0d is a threshold value for the distance. Finally, each element of II PS 
is “cleansed” by the original metric (in our case, NCC). “The lower-bound condition,” 
i.e., IIPS C IIPS, ensures the dirty filtering will not cause false dismissals. However, 
due to discrepancy between the original metric and the metric in the converted low- 
dimensional space, this condition tends to be difficult to hold. If we somehow achieve 
IIPS C IIPS, IIPS may become unexpectedly large, thus cleansing may take un- 
practically long. Instead we allow IIPS % IIPS, but we keep IIPS — IIPS (false 
dismissals) as small as possible. While, since the size of IIPS reflects the cleansing 
cost, i.e., the number of NCC calculation, this is also preferably small. 

To evaluate these, precision and recall are suitable: precision = | IIPS fl 
IIPS\/\IIPS\, recall = \IIPSnlIPS\/\IIPS\. In order to make IIPS small, preci- 
sion should be larger and close to one. On the other hand, to make IIPS — IIPS small, 
since 1 — recall = | IIPS — IIPS\/\IIPS\, recall should also be larger and close to 
one. In evaluation, we set another threshold Qi,Qi <0 h, to employ rejection range; image 
pairs having NCC values between 0i and Oh are excluded and will not be used for the 
evaluation. 
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3 Generalized Histogram 

3.1 Histogram Composition 

Histograms are sometimes used as low-dimensional representation of images, especially 
for image similarity matching [9,7] and multiplicity detection [1,8]. Assume that an im- 
age I is composed of M pixels, and its intensity value at the location (x,y) isv = v(x, y ) . 
Then the image can be regarded as the scatter of M pixels in the three dimensional space 
(x,y,v). To compute a histogram of the image, we first need to tessellate the (x,y,v) 
space into N non-overlapping regions Ri,i = 1, ... ,7V. Then an 7V-bin histogram 
(H = [hihz ... hjsf] T in vector notation) can be computed by counting pixels which 
fall into each region: hi = | {(x,y,v(x,y)) | (x,y,v(x,y)) in | . For example, if we 

divide the range of intensity value [v \ , v k ) into N regions at the points: Vi=vi+ ^ Vh ^ Vl ^ i 
where i = 0, . . . , TV, by defining the tessellation as Ri = { (x, p, v) | v^x < v < Vi }, we 
will obtain 7V-bin global intensity histograms. Subregion histograms can be obtained 
by the tessellation Rij,k = {(x,y,v) \ <x <Xi,yj-i <y <y j ,v k - 1 <v<v k ) 
where Xi, yj , and v k are dividing points of the range of x, y , and v respectively. Resultant 
histogram should then be reordered into a linear list by a particular ordering such 
as the lexicographic order. 

Obviously, choice of tessellation of the (x,y,v) space affects the performance of 
image similarity evaluation and matching. For subregion division, researchers employ 
regular division such as 1 x 1 (global histograms), 2 x 2, 3 x 3, etc., as well as irregular 
division such as using one at the center and four subregions in the peripheral region 
[10] . For tessellation of the range of intensity, or tessellation of the color space in most 
cases, adaptive binning techniques have been studied [11,12]. In the adaptive binning, 
tessellation is determined according to the actual distribution of pixels, and thus the 
tessellation would become fine in dense region, while rough in coarse region. If tessel- 
lation is independently determined for each image, resultant histograms better reflect 
the actual distribution of pixels. We call this dynamic adaptive binning. However, since 
bins of histograms of the different images do not necessarily correspond, special metrics 
such as the Earth Mover’s Distance [11] or weighted correlation [12] should be used, 
which are computationally costly than the simple Euclidean distance. Thus the dynamic 
adaptive binning would not fit to the purpose of dirty filtering. Instead, we can determine 
tessellation based on the distribution of pixels for all images in image collections. This 
makes the tessellation unique for all the images, therefore ordinary metrics including the 
Euclidean distance can be used. We call this static adaptive binning. We will consider 
static adaptive binning instead of dynamic adaptive binning in this paper. 

After histograms are obtained from images, we then need to evaluate distances be- 
tween histograms. In evaluating distance between two histograms H 1 = \h\ . . . h]^] and 
H 2 = [hi... h 2 N ], the Minkowski distance is sometimes used: d( H 1 ,H 2 ) = (JA || h\ — 

h 2 1 \ p ) p . Especially, the Manhattan distance (p = 1) and the Euclidean distance (p = 2) 
are frequently used. As a variant of this, the weighted Minkowski distance is also 
used: d^H 1 ,H 2 ) = — h 2 \\ p )p where Wi are weight coefficients. An adap- 

tive weighting technique optimizes coefficients adaptively, mainly according to users’ 
preference by relevance feedback in image retrieval systems [13]. Other metrics such as 
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quadratic metric [14] and Mahalanobis distance are also used. We mainly concentrate 
on the Euclidean distance in this paper. 




Fig. 2. Tessellation Approximation by Primitive Histogram 



3.2 Histogram Conversion 

If we have fine-grain tessellation, any tessellation can be approximated by aggregating 
the fine-grain tessellation. Based on this idea, we will show that histograms using adap- 
tive binning and various subregion can be realized by linear conversion of fine-grain 
histograms. Assume that regions Ri,i = 1, . . . ,7V compose fine-grain tessellation, i.e., 
the size of each Ri is small enough, and a histogram H = [hi . . . h]y] T is calculated by 
the tessellation. The tessellation could be regular, i.e., 8 x 8 x 16 for the range of x , y , and 
v respectively, which generates 1024-bin histograms. We call the fine-grain histograms 
as primitive histograms. Then any tessellation R • , i = 1 , . . . , N' can be approximated by 
Ri as follows: 

= | VklR'nRjl^lR^nRjl} 

3 

where R • is an approximation of R • and | • | represents the size of the region. Intuitively, 
R[ is approximated by the union of Rj which belong to R[ (see Fig. 2). Thus a histogram 
H' = [h[ . . . h' N ,] T based on the tessellation R[ is calculated using N' by N matrix A 
as follows: 



H' = AH 



(A)ij 



1 if i?' n Rj + 0 

0 otherwise. 



We call the matrix A as aggregation matrix. By this conversion, any variation of binning 
and subregion can be realized as a simple linear conversion at the cost of approximation 
error. The error can become arbitrarily small by using finer tessellation for primitive 
histogram calculation. 

Weighted distance between histograms can also be realized by linear conversion of 
histograms. The weighted Euclidean distance between histograms H 1 = [ h \ . . .h]^] T 
and H 2 = [h\ . . . h 2 N ] T can be calculated as: 

d 2 (H\H 2 ) = J2^ \\h]-h *\\ 2 

i 

where Wi are weight coefficients. This can be achieved by simply calculating the Eu- 
clidean distance of weighted histograms: H' = [wihi . . .WNhjsr] T . Thus weighted dis- 
tance can also be realized by linear histogram conversion: 
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H' = WH 

W = diagOi,w 2 , ...,w n)- 

Obviously these two conversions have the same linear form. Thus they can be com- 
bined easily, e.g., H r = AWH. We now allow the conversion matrix to be a general 
matrix G, i.e., H' = GH , which converts 7V-bin histogram H to TV'-bin histogram H ' . 
The converted histogram can represent not only various binning and subregion histogram 
plus weighted histogram, but also much general adaptable histogram. We call this his- 
togram as generalized histogram, and G as generalized histogram generator matrix, or 
generator matrix in short. By properly designing the generator matrix, generalized his- 
tograms can take advantage of adaptive binning, adaptive subregion, adaptive weighting, 
and possibly much flexible adaptability. 

3.3 Linear Learning of Generalized Histogram Generator Matrix 

Learning generator matrix from training samples would be an ideal method because it 
could adapt to actual distribution of images. As training sets, we assume that positive 
samples S + and negative samples S~ are given: 

5+ = {(I 1 ,/ 2 ) I NCC(/\/ 2 )>04 
s~ = {(I 1 ,/ 2 ) I NCC(/\ J 2 ) < 61} . 

Let 'H(I) be the primitive histogram of the image I. The Euclidean distance between 
generalized histograms of images 1 1 and 1 2 is: 

d 2 (l\l 2 ) = \\GU{I 1 )-GU{I 2 )\\ 2 = ||G[H(/ 1 )-H(/ 2 )]|| 2 . 

Ideal generator matrix G should preferably make distances between image pairs of pos- 
itive samples smaller, while at the same time distances between image pairs of negative 
samples larger. We can achieve this by a similar technique to well-known Multiple 
Discriminant Analysis [15] (MDA for short). Let covariance matrices of histogram dif- 
ferences of image pairs of positive and negative samples, G + and C~ resp., be: 

C +/ - = Y {'H(I 1 )-'H(I 2 ))(H(I 1 )-H(I 2 )) t . 

(7 1 ,/2)e-S'+/- 

Since GC G T represents a scatter matrix of vectors G [H (I 1 ) — % (/ 2 )] , an ideal generator 
matrix G minimizes the following criterion: 

det (GC+GY 
[ 1 det (GC~G T ) 

because det (GCG T ) is proportional to the square of the hyperellipsoidal scattering 
volume [15]. J(G) can be minimized by solving the following generalized eigenvalue 
problem: C~<P = C + <£>A where <P = [</>i </>2 • • • ^n] is eigenvector matrix and A is di- 
agonal matrix of eigenvalues. Thus N' by N optimal generalized histogram generator 
matrix G is obtained by N' eigenvectors corresponding to N' largest eigenvalues, i.e., 
G=[4> i4>2 ••• • 
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3.4 Learnability of Histogram Adaptation 

It is still unclear whether the presented algorithm produces similar effect to histogram 
adaptation techniques. We will discuss on this here. 

Assume that the algorithm determines generator matrix G =[<j)\ ... 4 >n'] T which 
converts primitive histogram H = [hi . . . Hn] t into generalized histogram H' = [h[ . . . 
h' N ,] T , i.e., H' = GH. The algorithm determines the plane spanned by 0i, . . . , (j ) N /, and 
generalized histogram is the projection of primitive histogram onto this plane. If there 
is large variance in the hi direction in positive samples S + , i.e., variance of hj — hj 
is large where (i 1 ,/ 2 ) G S+ and H(/ 1 ) = [...hj ...], H(/ 2 ) = [. . .hj . . .], especially if 
this is larger than that in the hj direction, then ideal adaptive weighting should provide 
larger weight to hj direction than to hi. This is because hj is stationary in the positive 
samples and thus useful for discrimination, but hi is unstable and should be suppressed. 
MDA has the similar effect since it determines projection direction so that to suppress 
large variance in the original scatter of positive samples. Thus the proposed algorithm 
has similar effect to adaptive weighting. 

In ideal adaptive binning, two bins are preferably merged (in most cases neighboring 
bins) if they are strongly correlated in training samples. When we think of subregion 
histograms, two blocks should be merged into one block if these two have correlated 
intensity between image pairs in samples. Two blocks have correlated intensity when 
corresponding bins between the blocks are correlated. Thus adaptive binning, as well as 
adaptive subregion, may be realized by merging correlated elements. Assume that the 
direction h p and h q are correlated. Then MDA will produce projection direction which is 
close to parallel to the plane spanned by h p and h q , resulting one direction in generalized 
histogram is made close to w p h p + w q h q , while the other independent direction of the 
form w' p h p + w' q h q is suppressed. Weights w p and w q will be determined according to 
the distribution in these directions, similarly to linear regression analysis. This obviously 
includes the case w p = w q , which has the same effect as aggregation matrix. The case 
when more than three directions are correlated can also be treated similarly. 

3.5 Sample Expansion 

In order to obtain effective generalized histograms, finer tessellation for primitive his- 
tograms is necessary, and thus high-dimensional vectors should be handled as primitive 
histograms. This makes estimation of generator matrix harder, because the estimation 
requires optimization of low-dimensional projection in very high-dimensional space. 
Generally, since the insufficient number of samples for learning in high-dimensional 
space may cause performance deterioration due to overfitting, learning in high-dimen- 
sional space requires more training samples (the preferable number of samples is expo- 
nential to the number of dimension [16]). However, it could be difficult to prepare the 
sufficient number of samples; especially for positive samples, by the nature of rareness 
of identical image pairs. 

In order to circumvent the problem, we expand the number of samples by virtu- 
ally adding perturbation to intensity value of each pixel. We call this technique sample 
expansion. Assume that we have a pair of images 1 1 and 1 2 from sample set (either 
positive or negative), and corresponding primitive histograms H 1 = [h\h\ ... h} N \ and 
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H 2 = {h\h 2 2 ...h 2 N ] . For simplicity, we think of intensity tessellation only. Covariance 
matrix of primitive histogram differences of image pairs can be obtained by summing 
the following elementary covariance matrix for all image pairs: 

E = (H 1 - H 2 )^ 1 - H 2 ) T . 

Basic idea of the sample expansion is to derive expectation of E with probabilistically 
adding perturbation to intensity values of pixels, and thus to derive expectation of co- 
variance matrix of primitive histogram differences. We assume that only one pixel of 
either I 1 ox I 2 will suffer from perturbation with probability P. If we virtually add per- 
turbation to a pixel such that to make the pixel belong to the neighbor of its original bin, 
and if the pixel originally belongs to hi , the value of hi will decrease and the value of 
its neighbor bin (hi - 1 or hi+ 1 , depending on the sign of the perturbation) will increase. 
Thus the primitive histogram becomes 

Pi,+ — \h\ • • • hi — i hi 1 hi -^. ipl . . . ^,/v] 

Hi - = [hi ... hi- l+1 hi - 1 h m ...h N ] T 

depending on the sign of the perturbation. The probability that the case Hi ?+ occurs is 
assumed to be equal to the probability of Hi-, and is the probability that a pixel in I 
falls in hi (we set this to pi ) times \P . If we assume the distribution of pixel intensity 
is identical for all images, the probability that a pixel in any image falls in Ri is pi. If 
we let p(v) be pdf of intensity value, pi is derived as follows: 



Pi = 



p(y)dx dy dv. 



( 1 ) 



In the case of H i + or Hi - for I 1 , i.e., Hj ± , the elementary covariance matrix is: 

E = E\Hl ± = (Hl ± -H 2 )(Hl ± -H 2 ) T 

(E)ij = (Ahi) 2 ^2Ahi + l 
{E)i i i + 1 = (P)i+i,i = AhiAhi+i d= Ahi =f Ahi+i — 1 

(E)i+ m+i = (^^i+i) 2 =b 2Ahi+i + 1 
(E)i :j = ( E) j:i = AhiAhj^Ahj 
(E)i+ i fj = (P)j,i+ 1 = Ahi-iAhj±Ahj 
(E)j i k = AhjAhk 

where Ahi = hj —h 2 , and + 1. We can then integrate the equations above to 

derive the expectation of the elementary covariance matrix: 



E {E} 



E E 

* sG{+,— } 



E\Hl+E\Hl 



PPi 



(AE)ij = E{(£7)ij} — Ahi Ah j 
(AE) hl = P(Ipi + Ip 2 ) 




218 



S. Satoh 



(AE)^i = P(~Pi~i+Pi + ~Pi+i) (i^l,N) 

(AE)n,N = P(-Pn- 1+2P n ) 

(AE)i-i^ = (AE)i^_i = -P(-pi_i + -p^ 

(. AE)ij = 0 ( otherwise ) 

(note that terms which have signs affected by the sign of perturbation are canceled), 
where E{-} provides expectation. We can then derive expectation of covariance matrices 
as follows: 



E{C + } = C + + \S + \AE 
E{C-} = C- + \S~\AE. 

This modification still keeps the covariance matrices positive semi-definite. In the above 
derivation, we think of intensity tessellation only. However, extension to joint tessellation 
of intensity and location is quite straightforward. Resultant expectation of covariance 
matrices is then used for calculation of generalized histogram generator matrix. 



4 Experiments 

We apply the proposed method to images obtained from broadcast video archives. We 
select four video footages, one hour long each, broadcasted in the same time slot, at 
the same channel, but on the different days. 10,000 randomly chosen images are then 
extracted from them to compose four image sets, namely, A, B, C, and D. Since they are 
taken from the same time slot, they include the same video programs (actually the slot 
includes news and documentary), and thus each image set is expected to have similar 
distribution in the image space, or at least they include some shots, identical each other, 
such as opening and ending shots. Two sets (A and B) are used as training samples, and 
the other two (C and D) are used for test. From the training set, the positive sample set 
is composed of pairs of images from two image sets having larger NCC value than 0^. 
The negative sample set is composed of 10,000 randomly chosen pairs of images having 
smaller NCC than 0i . From the test set, the positive sample set is generated similarly, 
but the negative sample set is composed of all pairs having smaller NCC than 6i . We use 
Oh = 0.9 and 0i = 0.8 in our experiments. The resultant positive and negative test sets are 
used as the ground truth for precision-recall evaluation. The size of positive sample sets 
is 2124 and 3278, while the size of negative sample sets is 10,000 and approximately 
10, 000 x 10, 000 for the training and test sets respectively. 

Primitive histograms are then calculated for images in the sets. Since we use NCC 
as the image matching criteria, intensity values of pixels v are first normalized by the 
mean m and the standard deviation cr into normalized intensity, i.e., v' = (v — m)/a. 
To compute primitive histograms, we use regular tessellation 8 x 8 x 16 in the (x, y , v') 
space. For the normalized intensity, we regularly divide the range [—2,2) since this range 
covers the most part of the distribution (95%) for the Gaussian distribution. We call the 
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histograms normalized intensity histograms. In this case, since normalized intensity 
values yield normalized Gaussian, Eq. (1) particularly becomes: 

f 1 

Pi = e 2 dxdydv. (2) 

JRi V27T 

We use this for p- L in calculating generalized histograms. 




Fig. 3. Comparison of matching performance 

We thus employ 8 x 8 x 16 = 1024-bin normalized intensity histograms as primitive 
histograms. By using training sets, a generator matrix is generated. Then 16-D gener- 
alized histograms are generated from test sets, and the image matching performance is 
evaluated in terms of precision-recall. For comparison, we generated 16-D normalized 
intensity histograms with regular tessellation 2 x 2 x 4 in the (x,y,v f ) space, and the 
filtering performance is evaluated using the Euclidean distance. We have investigated 
several combinations of regular tessellation and evaluated 8- to 64-D normalized inten- 
sity histograms, to find that the tessellation used here performs the best [17]. Figure 3 
shows the resultant precision-recall graphs comparing the matching performance of the 
normalized intensity histograms, the generalized histograms without sample expansion, 
and the generalized histograms with sample expansion. This clearly depicts that the gen- 
eralized histograms obviously outperform ordinary histograms. Moreover, the sample 
expansion technique drastically boosts the performance. The final performance exceeds 
90% recall and 90% precision at the same time, or even at 95% recall more than 80% 
precision is achieved, which is very satisfactory. In the evaluation, we changed the per- 
tubation probability P from 0.0001 to 0.01 and found that there was no effect on the 
matching performance, and thus confirmed that P is very insensitive to the performance. 
To inspect the dependence of the performance to training set, we conduct another exper- 
iment by swapping training and test sets, i.e., image sets C and D for training and A and 
B for test. The performance is almost the same as Fig. 3, and thus the proposed method 
is thought to be relatively insensitive to the change of training set. 





220 



S. Satoh 




Fig. 4. Identical Images in Test Sets 



We then review the detected identical images. Figure 4 (a)-(c) are example images 
successfully recognized as identical image pairs. In particular, (a) is an opening image 
of news, (b) is a weather forecast CG image of news, and (c) is an opening image of 
documentary. Since video segments identical to them appear everyday, similar images 
to them are included both in training and test sets as positive sample pairs. These video 
segments can be regarded as “punctuation,” so that the results can effectively be used 
for video parsing and structuring. On the other hand, Fig. 4 (d) shows an image, a 
pair identical to which is included in the positive samples of the test set, and thus is 
expected to be detected, but the method fails to detect. Actually the image represents an 
explanatory video segment, and is shared by news topics, explaining about the related 
topics broadcasted on the different days. The major reason of the failure is that image 
pairs identical to this image are included in neither positive nor negative samples of 
training sets. Statistically, such video segments are negligible, and thus the generalized 
histogram is statistically successful, but unfortunately may fail to detect extremely rare 
but interesting multiplicity. On the other hand, punctuation segment detection, as well as 
commercial film detection can effectively be achieved by the proposed method, because 
the distribution in the image space of possible multiplicity images is known beforehand. 

Then we analyse generator matrix G to inspect the mechanism of the generalized 
histogram, gij = (G)ij corresponds to the weight coefficient for the j - th element of 
primitive histograms hj , and [gug^ ••• gm] can be regarded as a vector having the 
same length to the primitive histograms. In calculating the primitive histograms, we use 
regular tessellation (8 x 8 x 16) in the (x,y,v f ) space to obtain 3-D histograms hijg c , 
then convert them into a vector by a particular ordering. We thus inversely apply the 
ordering to \g iX g i2 . . . gw] to obtain a 3-D tensor (G)* >9>r = g % pqr = gij where p, q, and 
r are reversely mapped index of j. In particular, let (p, q) correspond to the location of 
a region in the (x,y) space, and r correspond to the index of the range of normalized 
intensity. We then define contribution of the block at (p, q) in the i - th component as the 
variance of coefficients qi nr : 

xpqr 

r p,q = E {(9pqr-' E {9 t pq r\P, ( l, i } 2 \ PA,i\ ■ 

r p can be visualized as an image in the (p, g) space. In addition, g pqr can be regarded as 
weight coefficients for intensity histograms when we fix p, g, and i. They can be plotted 
as weight coefficients at the block (p, q) in the i-th component. We then visualize the 
first, second, and third components of the generator matrices obtained with and without 
the sample expansion in Fig. 5. For each component, contribution of each block is shown 
as brightness of the corresponding block; a brighter block represents more contribution. 
In addition, weight coefficients of blocks, in particular, for three blocks indicated as A, 
B, and C, are shown in the graphs. With sample expansion (Fig. 5 (a)-(c)), the block A 
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(a)-(c): with sample expansion, (d)-(f): without sample expansion 
Fig. 5. Visualization of Generator Matrix 
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has the highest contribution for the first component and the corresponding graph shows 
relatively high variation. On the other hand, C has very low contribution in the three 
components, and the corresponding graphs are relatively unchanging. Without sample 
expansion (Fig. 5 (d)-(f)), it is shown that all three components concentrate on the same 
block, namely, A, possibly due to overfitting. Weight coefficient graphs are all jagged 
shapes, disregarding high correlations between neighboring bins. On the other hand, in 
Fig. 5 (a)-(c) weight coefficient graphs are stable. Thus the effect of the sample expansion 
is visually shown. 

As for processing time, it takes about 4 minutes to train generalized histogram gener- 
ator matrix, and 12 minutes to convert 10,000 frames in a MPEG-1 file into generalized 
histograms (including MPEG decoding). Given converted generalized histograms, it 
takes 4.4 seconds for dirty filtering in matching two sets of 10,000 frames, and 8 min- 
utes for cleansing by NCC when precision is about 90%. We use tree-type data structure 
SR-tree [18] for range search instead of linear scan to accelerate dirty filtering. The 
experiments are conducted on an ordinary PC (Pentium 2GHz). The proposed method 
thus achieves tractable computation time for identical video segment detection. 

5 Conclusions 

We propose generalized histogram as low-dimensional representation of an image for 
efficient and precise image matching. Among techniques to enhance the matching per- 
formance for histogram-based matching, adaptive binning, subregion histogram, and 
adaptive weighting are inspected, and we show that these techniques can effectively be 
realized in the form of linear conversion of high-dimensional primitive histograms. Lin- 
ear learning algorithm to derive generalized histograms is introduced to take advantage 
of these enhancement techniques. A sample expansion technique is also introduced to 
circumvent the overfitting problem due to high-dimensionality and insufficient sample 
size. The effectiveness of the generalized histogram and sample expansion is revealed 
with experiments in detecting multiplicity in sets of randomly chosen 20,000 images 
taken from television broadcast. The matching using generalized histograms achieves 
almost the same performance compared to the precise matching using NCC, but may 
miss rare multiplicity which do not appear in training sets. This point should be inves- 
tigated for future work. Incorporation of high-dimensional indexing techniques as well 
as extension to identical shot discovery should also be addressed. We also think that the 
generalized histogram may also be useful for learning framework for appearance-base 
vision tasks. 



References 

1. Cheung, S.-S., Zakhor, A.: Video similarity detection with video signature clustering. In: 
Proc. of International Conference on Impage Processing. (2001) 649-652 

2. Hampapur, A., Bolle, R.M.: Comparison of distance measures for video copy detection. In: 
Proc. of ICME. (2001) 

3. Jaimes, A., Chang, S., Loui, A.C.: Duplicate detection in consumer photography and news 
video. In: Proc. of ACM Multimedia. (2002) 423-424 




Generalized Histogram: Empirical Optimization of Low Dimensional Features 223 



4. Satoh, S.: News video analysis based on identical shot detection. In: Proc. of International 
Conference on Multimedia and Expo. (2002) 

5. Zaniolo, C., Ceri, S., Faloutsos, C., Snodgrass, R.T., Subrahmanian, Y.S., Zicari, R.: Advanced 
Database Systems. Morgan Kaufmann (1997) 

6. Bohm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures 
for improving the performance of multimedia databases. ACM Computing Surveys 33 (2001) 
322-373 

7. Flickner, M., et al.: Query by image and video content: The QBIC system. IEEE Computer 
(1995) 23-32 

8. Vinod, V.V., Murase, H.: Focused color intersection with efficient searching for object ex- 
traction. Pattern Recognition 30 (1997) 1787-1797 

9. Swain, M., Ballard, D.: Color indexing. International Journal on Computer Vision 7 (1991) 
11-32 

10. Sclaroff, S., Taycher, L., Cascia, M.L.: ImageRover: A content-based image browser for 
the world wide web. In: Proc. of Workshop on Content-Based Access of Image and Video 
Libraries. (1997) 2-9 

11. Puzicha, J., Buhmann, J.M., Rubner, Y., Tomasi, C.: Empirical evaluation of dissimilarity 
measures for color and texture. In: Proc. of International Conference on Computer Vision. 
(1999) 1165-1173 

12. Leow, W.K., Li, R.: Adaptive binning and dissimilarity measure for image retrieval and 
classification. In: Proc. of Computer Vision and Pattern Recognition. Volume II. (2001) 
234-239 

1 3 . Rui, Y. , Huang, T. S . , Ortega, H. , Mehrotra, S . : Relevance feedback: A power tool for interactive 
content-based image retrieval. IEEE Trans, on Circuits and Systems for Video Technology 8 
(1998) 644-655 

14. Hafner, J., Sawhney, H., Equitz, W., Flickner, M., Niblack, W.: Efficient color histogram 
indexing for quadratic form distance functions. IEEE Trans, on Pattern Analysis and Machine 
Intelligence 17 (1995) 729-736 

15. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Inc. (2001) 

16. Fukunaga, K.: Introduction to Statistical Pattern Recognition (2nd ed.). Academic Press 
(1990) 

17. Yamagishi, F., Satoh, S., Hamada, T., Sakauchi, M.: Identical video segment detection for 
large-scale broadcast video archives. In: Proc. of International Workshop on Content-Based 
Multimedia Indexing (CBMI’03). (2003) 135-142 

18. Katayama, N., Satoh, S.: The SR-tree: An index structure for high-dimensional nearest 
neighbor queries. In: Proc. ACM SIGMOD. (1997) 369-380 




Recognizing Objects in Range Data Using 
Regional Point Descriptors 



Andrea Frome 1 , Daniel Huber 2 , Ravi Kolluri 1 , 
Thomas Billow 1 *, and Jitendra Malik 1 

1 University of California Berkeley, Berkeley CA 94530, USA, 

{afrome,rkolluri,malik}@cs .berkeley . edu 
thomas . buelow@philips . com 
2 Carnegie Mellon University, Pittsburgh PA 15213, USA 
dhuber@cs . emu . edu 



Abstract. Recognition of three dimensional (3D) objects in noisy and 
cluttered scenes is a challenging problem in 3D computer vision. One 
approach that has been successful in past research is the regional shape 
descriptor. In this paper, we introduce two new regional shape descrip- 
tors: 3D shape contexts and harmonic shape contexts. We evaluate the 
performance of these descriptors on the task of recognizing vehicles in 
range scans of scenes using a database of 56 cars. We compare the two 
novel descriptors to an existing descriptor, the spin image, showing that 
the shape context based descriptors have a higher recognition rate on 
noisy scenes and that 3D shape contexts outperform the others on clut- 
tered scenes. 



1 Introduction 

Recognition of three dimensional (3D) objects in noisy and cluttered scenes is a 
challenging problem in 3D computer vision. Given a 3D point cloud produced by 
a range scanner observing a 3D scene (Fig. 1), the goal is to identify objects in 
the scene (in this case, vehicles) by comparing them to a set of candidate objects. 
This problem is challenging for several reasons. First, in range scans, much of the 
target object is obscured due to self-occlusion or is occluded by other objects. 
Nearby objects can also act as background clutter, which can interfere with the 
recognition process. Second, many classes of objects, for example the vehicles in 
our experiments, are very similar in shape and size. Third, range scanners have 
limited spatial resolution; the surface is only sampled at discrete points, and fine 
detail in the objects is usually lost or blurred. Finally, high-speed range scanners 
(e.g., flash ladars) introduce significant noise in the range measurement, making 
it nearly impossible to manually identify objects. 

Object recognition in such a setting is interesting in its own right, but would 
also be useful in applications such as scan registration [9] [6] and robot local- 
ization. The ability to recognize objects in 2 1/2-D images such as range scans 
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T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3023, pp. 224-237, 2004. 
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Fig. 1 . (a) An example of a cluttered scene containing trees, a house, the ground, 
and a vehicle to be recognized, (b) A point cloud generated from a scan simulation of 
the scene. Notice that the range shadow of the building occludes the front half of the 
vehicle. 

may also prove valuable in recognizing objects in 2D images when some depth 
information can be inferred from cues such as shading or motion. 

Many approaches to 3D object recognition have been put forth, including gen- 
eralized cylinders [3], superquadrics [7], geons [23], medial axis representations 
[1], skeletons [4], shape distributions [19], and spherical harmonic representa- 
tions of global shape [8]. Many of these methods require that the target be seg- 
mented from the background, which makes them difficult to apply to real-world 
3D scenes. Furthermore, many global methods have difficulty leveraging subtle 
shape variations, especially with large parts of the shape missing from the scene. 
At the other end of the spectrum, purely local descriptors, such as surface cur- 
vature, are well-known for being unstable when faced with noisy data. Regional 
point descriptors lie midway between the global and local approaches, giving 
them the advantages of both. This is the approach that we follow in this paper. 

Methods which use regional point descriptors have proven successful in the 
context of image-based recognition [17] [15] [2] as well as 3D recognition and sur- 
face matching [22] [13] [5] [21]. A regional point descriptor characterizes some prop- 
erty of the scene in a local support region surrounding a basis point. In our case, 
the descriptors characterize regional surface shape. Ideally, a descriptor should 
be invariant to transformations of the target object (e.g., rotation and trans- 
lation in 3D) and robust to noise and clutter. The descriptor for a basis point 
located on the target object in the scene will, therefore, be similar to the de- 
scriptor for the corresponding point on a model of the target object. These 
model descriptors can be stored in a pre-computed database and accessed using 
fast nearest-neighbor search methods such as locality-sensitive hashing [11]. The 
limited support region of descriptors makes them robust to significant levels of 
occlusion. Reliable recognition is made possible by combining the results from 
multiple basis points distributed across the scene. 

In this paper we make the following contributions: (1) we develop the 3D gen- 
eralization of the 2D shape context descriptor, (2) we introduce the harmonic 
shape context descriptor, (3) we systematically compare the performance of the 
3D shape context, harmonic shape context, and spin images in recognizing sim- 
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ilar objects in scenes with noise or clutter. We also briefly examine the trade-off 
of applying hashing techniques to speed search over a large set of objects. 

The organization of the paper is as follows: in section 2, we introduce the 
3D shape context and harmonic shape context descriptors and review the spin 
image descriptor. Section 3 describes the representative descriptor method for 
aggregating distances between point descriptors to give an overall matching score 
between a query scene and model. Our data set is introduced in section 4, and 
our experiments and results are presented in section 5. We finish in section 6 
with a brief analysis of a method for speeding our matching process. 

2 Descriptors 

In this section, we provide the details of the new 3D shape context and harmonic 
shape context descriptors and review the existing spin-image descriptor. All three 
descriptors take as input a point cloud V and a basis point p, and capture the 
regional shape of the scene at p using the distribution of points in a support 
region surrounding p. The support region is discretized into bins, and a histogram 
is formed by counting the number of points falling within each bin. For the 3D 
shape contexts and spin-images, this histogram is used directly as the descriptor, 
while with harmonic shape contexts, an additional transformation is applied. 

When designing such a 3D descriptor, the first two decisions to be made 
are (1) what is the shape of the support region and (2) how to map the bins 
in 3D space to positions in the histogram vector. All three methods address 
the second issue by aligning the support region’s “up” or north pole direction 
with an estimate of the surface normal at the basis point, which leaves a degree 
of freedom along the azimuth. Their differences arise from the shape of their 
support region and how they remove this degree of freedom. 



2.1 3D Shape Contexts 

The 3D shape context is the straightforward extension of 2D shape contexts, 
introduced by Belongie et al. [2], to three dimensions. The support region for a 3D 
shape context is a sphere centered on the basis point p and its north pole oriented 
with the surface normal estimate A f for p (Fig. 2). The support region is divided 
into bins by equally spaced boundaries in the azimuth and elevation dimensions 
and logarithmically spaced boundaries along the radial dimension. We denote 
the J - hi radial divisions by R = {Rq . . . Rj }, the K + 1 elevation divisions by 
0 = {0o • • • Ok}, and the L- hi azimuth divisions by ^ = {^o • • • Each bin 
corresponds to one element in the J x K x L feature vector. The first radius 
division R 0 is the minimum radius r m i n , and Rj is the maximum radius r max . 
The radius boundaries are calculated as 



Rj = exp |ln(r min ) + { l n j 



( 1 ) 
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Sampling logarithmically makes the descriptor 
more robust to distortions in shape with distance 
from the basis point. Bins closer to the center are 
smaller in all three spherical dimensions, so we use 
a minimum radius (r m i n > 0) to avoid being overly 
sensitive to small differences in shape very close 
to the center. The O and <P divisions are evenly 
spaced along the 180° and 360° elevation and az- 
imuth ranges. 

Bin(j, /) accumulates a weighted count w(pi) 
for each point pi whose spherical coordinates rela- 
tive to p fall within the radius interval [Rj,Rj+ 1 ), 
azimuth interval [@k,@k+i) and elevation interval 
[0/,(9j + 1 ). The contribution to the bin count for 
point pi is given by 

w(pi) = , = 

PiVnJJJ) 

where V(j, fc, l) is the volume of the bin and pi is the local point density around 
the bin. Normalizing by the bin volume compensates for the large variation in bin 
sizes with radius and elevation. We found empirically that using the cube root of 
the volume retains significant discriminative power while leaving the descriptor 
robust to noise which causes points to cross over bin boundaries. The local point 
density pi is estimated as the count of points in a sphere of radius 5 around pi. 
This normalization accounts for variations in sampling density due to the angle 
of the surface or distance to the scanner. 

We have a degree of freedom in the azimuth direction that we must remove in 
order to compare shape contexts calculated in different coordinate systems. To 
account for this, we choose some direction to be in an initial shape context, 
and then rotate the shape context about its north pole into L positions, such 
that each <3?i division is located at the original 0° position in one of the rotations. 
For descriptor data sets derived from our reference scans, L rotations for each 
basis point are included, whereas in the query data sets, we include only one 
position per basis point. 




Fig. 2. Visualization of the 
histogram bins of the 3D 
shape context. 

( 2 ) 



2.2 Harmonic Shape Contexts 

To compute harmonic shape contexts, we begin with the histogram described 
above for 3D shape contexts, but we use the bin values as samples to calculate 
a spherical harmonic transformation for the shells and discard the original his- 
togram. The descriptor is a vector of the amplitudes of the transformation, which 
are rotationally invariant in the azimuth direction, thus removing the degree of 
freedom. 

Any real function /(#, <fi) can be expressed as a sum of complex spherical 
harmonic basis functions YJ m . 
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oo m=l 

/(M)=EE™« (3) 

1=0 m=—l 

A key property of this harmonic transformation is that a rotation in the az- 
imuthal direction results in a phase shift in the frequency domain, and hence 
amplitudes of the harmonic coefficients WA^W are invariant to rotations in the 
azimuth direction. We translate a 3D shape context into a harmonic shape con- 
text by defining a function fj(0,q !>) based on the bins of the 3D shape context 
in a single spherical shell Rj < R < Rj+i as: 



= SC(j,k,l),9 k <9 < 9 k+ 1, (f>i<(t>< 4>i+ 1. (4) 

As in [14], we choose a bandwidth b and store only b lowest-frequency 
components of the harmonic representation in our descriptor, which is given 
by HSC(l,m,k ) = Z,m = 0 ... 6 , r = 0 ...K. For any real function, 

WA^W = ||A z -m ||, so we drop the coefficients A f 1 for m < 0. The dimensionality 
of the resulting harmonic shape context is K • b{b + l)/2. Note that the num- 
ber of azimuth and elevation divisions do not affect the dimensionality of the 
descriptor. 

Harmonic shape contexts are related to the rotation-invariant shape descrip- 
tors SH (/) described in [14]. One difference between those and the harmonic 
shape contexts is that one SH (/) descriptor is used to describe the global shape 
of a single object. Also, the shape descriptor SH(/) is a vector of length b 
whose components are the energies of the function / in the b lowest frequen- 
cies: SH i(f) = || Y^m=-i I n contrast, harmonic shape contexts retain 

the amplitudes of the individual frequency components, and, as a result, are 
more descriptive. 



2.3 Spin Images 

We compared the performance of both of these shape context-based descriptors 
to spin images [13]. Spin-images are well-known 3D shape descriptors that have 
proven useful for object recognition [13], classification [20], and modeling [10]. 
Although spin-images were originally defined for surfaces, the adaptation to 
point clouds is straightforward. The support region of a spin image at a basis 
point p is a cylinder of radius r max and height h centered on p with its axis 
aligned with the surface normal at p. The support region is divided linearly into 
J segments radially and K segments vertically, forming a set of J x K rings. 
The spin-image for a basis point p is computed by counting the points that fall 
within each ring, forming a 2D histogram. As with the other descriptors, the 
contribution of each point qi is weighted by the inverse of that point’s density 
estimate {pi)] however, the bins are not weighted by volume. Summing within 
each ring eliminates the degree of freedom along the azimuth, making spin- 
images rotationally invariant. We treat a spin-image as a J x K dimensional 
feature vector. 




Recognizing Objects in Range Data Using Regional Point Descriptors 229 



3 Using Point Descriptors for Recognition 

To compare two descriptors of the same type to one another, we use some mea- 
sure of distance between the feature vectors: 1 2 distance for 3D shape contexts 
and spin images, and the inverse of the normalized correlation for harmonic 
shape contexts. Given a query scene S q and a set of reference descriptors cal- 
culated from scans of known models, we would like to choose the known model 
which is most similar to an object in S q . After we calculate descriptors from S q 
and distances between the query descriptors and reference descriptors, we face 
the problem of how to aggregate these distances to make a choice as to which 
model is the best match to S q . 

A straightforward way of doing this would be to have every descriptor from 
S q vote for the model that gave the closest descriptor, and choose the model with 
the most votes as the best match. The problem is that in placing a hard vote, 
we discard the relative distances between descriptors which provide information 
about the quality of the matches. To remedy this, we use the representative 
shape context method introduced in Mori et al. [18], which we refer to as the 
representative descriptor method , since we also apply it to spin images. 

3.1 Representative Descriptor Method 

We precompute M descriptors at points pi, ...pm f° r each reference scan Si, and 
compute at query time K descriptors at points qi,...qK from the query scene 
S q , where K <C M. We call these K points representative descriptors (RDs). For 
each of the query points qk and each reference scan Si, we find the descriptor 
Pm computed from Si that has the smallest t 2 distance to qk . We then sum the 
distances found for each qk, and call this the representative descriptor cost of 
matching S q to Sf. 

cost (S q ,Si)= y min dist (qk,Pm) (5) 

The best match is the reference model S that minimizes this cost. 

Scoring matches solely on the representative descriptor costs can be thought 
of as a lower bound on an ideal cost measure that takes geometric constraints 
between points into account. We show empirically that recognition performance 
using just these costs is remarkably good even without a more sophisticated 
analysis of the matches. 

One could select the center points for the representative descriptors using 
some criteria, for example by picking out points near which the 3D structure 
is interesting. For purposes of this paper, we sidestep that question altogether 
and choose our basis points randomly. To be sure that we are representing the 
performance of the algorithm, we performed each representative descriptor ex- 
periment 100 times with different random subsets of basis points. For each run 
we get a recognition rate that is the percentage of the 56 query scenes that we 
correctly identified using the above method. The mean recognition rate is the 
recognition rate averaged across runs. 
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Fig. 3. The 56 car models used in our experiments shown to scale. 



4 Data Set 

We tested our descriptors on a set of 56 3D models of passenger vehicles taken 
from the De Espona 3D model library [12] and rescaled to their actual sizes 
(Fig. 3). 1 The point clouds used in our experiments were generated using a laser 
sensor simulator that emulates a non-commercial airborne range scanner system. 
We have shown in separate experiments that these descriptors work well for real 
data, but for these experiments, our goal was to compare the performance of the 
descriptors in controlled circumstances. 

We generated two types of point clouds: a set of model or “reference” scans, 
and several sets of scene or “query” scans. For each vehicle, we generated four 
reference scans with the sensor positioned at 90° azimuth intervals (</> = 45°, 
135°, 225°, and 315°), a 45° declination angle, and a range of 500 m from the 

1 The Princeton Shape Benchmark, a data set with 1,814 3D models, was re- 
cently released. We didn’t learn of the data set in time to use it in this pa- 
per, but we will be using it in future experiments. It can be found online at 
http://shape.cs.princeton.edu/benchmark/. 
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target. The resulting point clouds contained an average of 1,990 target points 
spaced approximately 6 cm apart. The query scans were generated in a similar 
manner, except that the declination was 30° and the azimuth was at least 15° 
different from the nearest reference scan. Depending on the experiment, either 
clutter and occlusion or noise was added. Clutter and occlusion were generated 
by placing the model in a test scene consisting of a building, overhanging trees, 
and a ground plane (Fig. 1(a)). The point clouds for these scenes contained an 
average of 60,650 points. Noisy scans were modeled by adding Gaussian noise 
(A/"(0, cr)) along the line of sight of each point. 



V,rV'f:‘,'\V' rr . 

V 






' ' ■ v " . 



(a) 



(b) 



(c) 



Fig. 4. The top row shows scans from the 1962 Ferrari 250 model, and the bottom scans 
are from the Dodge Viper. The scans in column (a) are the query scans at 30° elevation 
and 15° azimuth with a = 5 cm noise, and those in (b) are from the same angle but 
with a — 10 cm noise. With 10 cm noise, it is difficult to differentiate the vehicles by 
looking at the 2D images of the point clouds. Column (c) shows the reference scans 
closest in viewing direction to the query scans (45° azimuth and 45° elevation). In the 5 
cm and 10 cm noise experiments, we first chose 300 candidate basis points and sampled 
RDs from those. 



Basis points for the descriptors in the reference point clouds were selected 
using a method that ensures approximately uniform sampling over the model’s 
visible surface. Each point cloud was divided into 0.2-meter voxels and one point 
was selected at random from each occupied voxel, giving an average of 373 de- 
scriptors per point cloud (1,494 descriptors per model). Basis points in the query 
point clouds were chosen using the same method, except that the set was fur- 
ther reduced by selecting a random subset of N basis points (N=300 for the 
clutter- free queries and N=2000 for the clutter queries) from which representa- 
tive descriptors were chosen. For a given experiment, the same subset of basis 
points were used in generating the three types of descriptors. After noise and 
clutter were added, normals for the basis points were computed using a method 
which preserves discontinuities in the shape and that accounts for noise along the 
viewing direction [16]. The algorithm uses points within a cube-shaped window 
around the basis point for the estimation, where the size of the window can be 
chosen based on the expected noise level. 
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5 Experiments 

The parameters for the descriptors (Table 1) were chosen based on extensive 
experimentation on other sets of 3D models not used in these experiments (Ta- 
ble 1). However, some parameters (specifically K and r m i n ) were fine-tuned using 
descriptors in 20 randomly selected models from our 56 vehicle database. The 
basis points used for training were independent from those used in testing. The 
relative scale of the support regions was chosen to make the volume encompassed 
comparable across descriptors. 



Table 1. Parameters used in the ex- 
periments for shape contexts (SC), har- 
monic shape contexts (HSC), and spin 
images (SI). All distances are in meters 





SC HSC 


SI 


max radius (r max ) 


2.5 


2.5 


2.5 


min radius (r m in) 


0.1 


0.1 


- 


height (h) 


- 


- 


2.5 


radial divisions ( J) 


15 


15 


15 


elev./ht. divisions ( K ) 


11 


11 


15 


azimuth divisions (L) 


12 


12 


- 


bandwidth ( b ) 


- 


16 


- 


dimensions 


1980 2040 225 


density radius (5) 


0.2 


0.2 


0.2 



5.1 Scenes with 5 cm Noise 




Fig. 5. Results for the 5cm noise ex- 
periment. All three methods performed 
roughly equally. From 300 basis points 
sampled evenly from the surface, we chose 
varying numbers of RDs, and recorded 
the mean recognition rate. The error bars 
show one standard deviation. 



In this set of experiments, our query data was a set of 56 scans, each containing 
one of the car models. We added Gaussian noise to the query scans along the 
scan viewing direction with a standard deviation of 5 cm (Fig. 4). The window 
for computing normals was a cube 55 cm on a side. Fig. 5 shows the mean 
recognition rate versus number of RDs. All of the descriptors perform roughly 
equally, achieving close to 100% average recognition with 40 RDs. 



5.2 Scenes with 10 cm Noise 

We performed two experiments with the standard deviation increased to 10 cm 
(see Fig. 4). In the first experiment, our window size for computing normals was 
the same as in the 5 cm experiments. The results in Fig. 6 show a significant 
decrease in performance by all three descriptors, especially spin images. To test 
how much the normals contributed to the decrease in recognition, we performed 
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(a) 



(b) 



Fig. 6. Results for 10 cm noise experiments. In experiment (a) we used a window for 
the normals that was a cube 55 cm on a side, whereas in (b) the size was increased to 
a cube 105 cm on a side. The error bars show one standard deviation from the mean. 
From this experiment, we see that shape contexts degrade less as we add noise and in 
particular are less sensitive to the quality of the normals than spin images. All three 
methods would benefit from tuning their parameters to the higher noise case, but this 
would entail a recalculation of the reference set. In general, a method that is more 
robust to changes in query conditions is preferable. 



a second experiment with a normal estimation window size of 105 cm, giving us 
normals more robust to noise. The spin images showed the most improvement, 
indicating their performance is more sensitive to the quality of the normals. 



5.3 Cluttered Scenes 

To test the ability of the descriptors to handle a query scene containing substan- 
tial clutter, we created scenes by placing each of the vehicle models in the clutter 
scene shown in Fig. 1(a). We generated scans of each scene from a 30° declina- 
tion and two different azimuth angles (</> = 150° and (j) = 300°), which we will 
call views #1 and ^2 (Fig. 7). We assume that the approximate location of the 
target model is given in the form of a box-shaped volume of interest (VOI). The 
VOI could be determined automatically by a generic object saliency algorithm, 
but for the controlled experiments in this paper, we manually specified the VOI 
tobea2mx4mx6m volume that contains the vehicle as well as some clutter, 
including the ground plane (Fig. 7(b)). Basis points for the descriptors were cho- 
sen from within this VOI, but for a given basis point, all the scene points within 
the descriptor’s support region were used, including those outside of the VOI. 

We ran separate experiments for views 1 and 2, using 80 RDs for each run. 
When calculating the representative descriptor cost for a given scene-model pair, 
we included in the sum in equation (5) only the 40 smallest distances between 
RDs and the reference descriptors for a given model. This acts as a form of outlier 
rejection, filtering out many of the basis points not located on the vehicle. We 
chose 40 because approximately half of the basis points in each VOI fell on a 
vehicle. The results are shown in Fig. 8. 
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Fig. 7. The cluttered scene with the Karmann Ghia. Picture (a) is the scan from view 
2, and (b) is a close-up of the VOI in view 1. For the fully-rendered scene and the full 
scan from view 1, refer to Fig. 1. The scanner in view 1 was located on the other side of 
the building from the car, causing the hood of the car to be mostly occluded. In view 
2, the scanner was on the other side of the trees, so the branches occlude large parts 
of the vehicle. There were about 100 basis points in the VOI in each query scene, and 
from those we randomly chose 80 representative descriptors for each run. 





(b) 



Fig. 8. Cluttered scene results. In both, we included in the cost the 40 smallest dis- 
tances out of those calculated for 80 RDs. The graphs show recognition rate versus 
rank depth with error bars one standard deviation from the mean. We calculated the 
recognition rate based on the k best choices, where k is our rank depth (as opposed to 
considering only the best choice for each query scene). We computed the mean recog- 
nition rate as described before, but counted a match to a query scene as “correct” 
if the correct model was within the top k matches. Graph (a) shows the results for 
view #1 and (b) for view #2. Using the 3D shape context we identifying on average 
78% of the 56 models correctly using the top 5 choices for each scene, but only 49% 
of the models if we look at only the top choice for each. Spin images did not perform 
as well; considering the top 5 matches, spin images achieved a mean recognition rate 
of 56% and only 34% if only the top choice is considered. Harmonic shape contexts do 
particularly bad, achieving recognition slightly above chance. They chose the largest 
vehicles as matches to almost all the queries. 
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The shape context performance is impressive given that this is a result of 
doing naive point-to-point matching without taking geometric constraints into 
account. Points on the ground plane were routinely confused for some of the car 
models which geometric constraints could rule out. A benefit of the 3D shape 
context over the other two descriptors is that a point-to-point match gives a 
candidate orientation of the model in the scene which can be used to verify 
other point matches. 



6 Speeding Search with Locality-Sensitive Hashing 



9 * 



+ 



In this section, we briefly explore the cost of using 3D shape contexts and discuss 
a way to bring the amount of computation required for a 3D shape context query 
closer to what is used for spin images while maintaining accuracy. 

In the spin image and harmonic shape context experiments, we are comparing 
each of our representative descriptors to 83,640 reference descriptors. We must 
compare to the 12 rotations when using 3D shape contexts, giving a total of 
1,003,680. Our system implementation takes 7.4 seconds on a 2.2 GHz processor 
to perform the comparison of one 3D shape context to the reference set. 

Fast search techniques such as locality-sensitive hashing (LSH) [11] can re- 
duce the search space by orders of magnitude, making it more practical to 
search over the 3D shape context rotations, though there is a tradeoff between 
speed and accuracy of the nearest-neighbor result. The method divides the high- 
dimensional feature space where the descriptors he into hypercubes, divided 
by a set of k randomly-chosen axis-parallel hyperplanes. These define a hash 
function where points that he in the same hy- 
percube hash to the same value. The greater 
the number of planes, the more likely that two 
neighbors will have different hash values. The 
probability that two nearby points are sepa- 
rated is reduced by independently choosing l 
different sets of hyperplanes, thus defining l 
different hash functions. Given a query vector, 
the result is the set of hashed vectors which 
share one of their l hash values with the query 
vector. 

In Figure 9, we show LSH results on the 
10cm noise dataset with the 105 cm window 
size using 160 RDs (exact nearest neighbor re- 
sults are shown in Figure 6(b)). We chose this 
data set because it was the most challenging 
of the noise tests where spin images performed 
well (using an easier test such as the 5 cm 
noise experiment provides a greater reduction 
in the number of comparisons). In calculating 
the RD costs, the distance from a query point 



spin image, exact NN 
3D shape context, exact NN 
LSH k=200, 1=20 
LSH k=200, 1=10 
LSH k=250, 1=20 



Number of featuers compared (log scale) 

Fig. 9. Results for LSH experi- 
ments with 3D shape contexts on 
the 10cm noise query dataset us- 
ing the 105 cm window size. Shown 
are results using 160 RDs where 
we included the 80 smallest dis- 
tances in the RD sum. The exact 
nearest neighbor results for spin 
images and 3D shape contexts are 
shown for comparison. 
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to a given model for which there were no hash matches was set to a value larger 
than any of the other distances. In this way, we penalized for a failure to match 
any hashed descriptors. To remove outliers caused by unlucky hash divisions, 
we included in the sum in equation (5) only the 80 smallest distances between 
RDs and the returned reference descriptors. Note that performing LSH using 3D 
shape contexts with k = 200 hash divisions and Z = 10 hash functions requires 
fewer descriptor comparisons than an exact nearest neighbor search using spin 
images, and provides slightly better accuracy. 
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Abstract. In this paper, we propose a new PDE-based methodology for 
deformable surfaces that is capable of automatically evolving its shape 
to capture the geometric boundary of the data and simultaneously dis- 
cover its underlying topological structure. Our model can handle multiple 
types of data (such as volumetric data, 3D point clouds and 2D image 
data), using a common mathematical framework. The deformation be- 
havior of the model is governed by partial differential equations (e.g. 
the weighted minimal surface flow). Unlike the level-set approach, our 
model always has an explicit representation of geometry and topology. 
The regularity of the model and the stability of the numerical integra- 
tion process are ensured by a powerful Laplacian tangential smoothing 
operator. By allowing local adaptive refinement of the mesh, the model 
can accurately represent sharp features. We have applied our model for 
shape reconstruction from volumetric data, unorganized 3D point clouds 
and multiple view images. The versatility and robustness of our model 
allow its application to the challenging problem of multiple view recon- 
struction. Our approach is unique in its combination of simultaneous use 
of a high number of arbitrary camera views with an explicit mesh that 
is intuitive and easy-to-interact-with. Our model-based approach auto- 
matically selects the best views for reconstruction, allows for visibility 
checking and progressive refinement of the model as more images become 
available. The results of our extensive experiments on synthetic and real 
data demonstrate robustness, high reconstruction accuracy and visual 
quality. 



1 Introduction 

During the past decade, PDE-driven surface evolution has become very popular 
in the computer vision community for shape recovery and object detection. Most 
of the existing work is based on the Eulerian approach, i.e., the geometry and 
topology of the shape is implicitly defined as the level-set solution of time- varying 
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implicit functions over the entire 3D space [24], which can be computationally 
very expensive. In this paper, we propose a new PDE-based deformable model 
that, in contrast, takes the Lagrangian approach, i.e. , the geometry and topology 
of the deformable surface are always explicitly represented throughout the simu- 
lation process. The elegance of our approach lies in the fact that we can use the 
same PDE-based model for different types of data. The only thing that is data- 
dependant is the control function, which describes the interaction with the data. 
This is an important property that will allow easy application of our method- 
ology to other types of data, such as points, surfels, images and to incorporate 
other visual cues such as shading and optical flow. 

Starting with [17], deformable models have achieved great success in the 
areas of computer vision and pattern recognition. In general, deformable models 
can be divided into two categories: explicit models and implicit models. Explicit 
models include parametric representations [27] and discrete representations [28]. 
Implicit models [24,5,14] handle topology changes, based on the modeling of 
propagating fronts, which are the level set of some scalar function. The desirable 
shape must be explicitly evaluated using marching-cube-like techniques [23] in 
an additional post-processing stage. The narrow band algorithm [14] can reduce 
the computational cost related to the higher dimension. Recently, topologically 
adaptive explicit models have been proposed [26,22], reviewed in detail in [25]. 
The aforementioned deformable models were proposed mainly for the purpose of 
shape reconstruction from volumetric data and for medical image segmentation. 
For shape reconstruction from point clouds, existing work is mostly on static 
methods. They are either explicit methods [11,1,8], implicit methods [15] or 
based on radial basis functions [4,10]. 

Compared with level-set based methods, our new model is simpler, more in- 
tuitive, and makes it easier to incorporate user-control during the deformation 
process. To ensure the regularity of the model and the stability of the numerical 
integration process, powerful Laplacian tangential smoothing, along with com- 
monly used mesh optimization techniques, is employed throughout the geometric 
deformation and topological variation process. The new model can either grow 
from the inside or shrink from the outside, and it can automatically split to 
multiple objects whenever necessary during the deformation process. More im- 
portantly, our model supports level-of-details control through global subdivision 
and local/adaptive subdivision. Based on our experiments, the new model can 
generate a good, high-quality polygonal mesh that can capture underlying topo- 
logical structure simultaneously from various datasets such as volumetric data, 
3D unorganized point clouds and multiple view images. The explicit representa- 
tion of the model enables us to check for visibility and camera pose directly. 

Automatic reconstruction of 3D objects and environments from photographic 
images is important for many applications that integrate virtual and real 
data. [29] Many approaches have been used to solve the problem, e.g. by match- 
ing features [34] or textures [12]. In traditional stereo methods, many partial 
models must be computed with respect to a set of base viewpoints, and the sur- 
face patches must be fused into a single consistent model [2] by Iterative Closest 
Points (ICP) [30], but a parameterized model is still needed for final dense sur- 
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face reconstruction, and there is no explicit handling of occlusion. Meshes and/or 
systems of particles [13] can be deformed according to constraints derived from 
images, but may end up clustering in areas of high curvature, and often fail with 
complicated topology. Recently, voxel-based approaches have been widely used 
to represent 3D shape [12,32,21,3,7] based on 3D scene space instead of image 
space. Marching-cube-like techniques [23] are still necessary to get the parame- 
terized surface. The space carving method [21] recovers a family of increasingly 
tighter supersets of the true scene. Once a voxel is carved away, it cannot be 
recovered, and any errors propagate into further erroneous carvings. This is par- 
tially alleviated by probabilistic space carving [8]. The level set method [12,9,16] 
is based upon variational analysis of the objects in the scene and their images 
while handling topology changes automatically. To overcome the complexity of 
implicit representation of objects in the level set approach, [18,19] operate on a 
surface represented by a depth function, at the cost of being limited to surface 
patches. A multi-resolution method using space carving and level sets meth- 
ods [33], starts with coarse settings, refined when necessary. 

However these methods do not take advantage of the existence of a sim- 
ple explicit geometric representation. We show that high quality results can be 
achieved with the use of a simple intuitive mesh, easy to interact with and can 
both incorporate all the available image information and also allow for pro- 
gressive refinement as more images become available. With increased computer 
performance, our reconstruction method will soon achieve interactive run times. 
One can envision user controlling the quality of the reconstruction during image 
capture, being able to capture the most necessary remaining views to complete 
the reconstruction [31]. Our PDE-based deformable model is described in sec. 2 
and applied to volumetric data in sec. 3.1, unorganized point clouds in sec. 3.2 
and multi-view images in sec. 3.3. Experimental results on synthetic and real 
data are presented in sec. 4. 

2 PDE-Based Deformable Surface 

The deformation behavior of our new deformable surface is governed by an evo- 
lutionary system of nonlinear initial-value partial differential equations (PDE) 
with the following general form: 

^M=F(t,k,k',f---)U(p,t) (1) 

where F is speed function, t is the time parameter, k and k' are the surface curva- 
ture and its derivative at the point p, and / is the external force. 5(p, 0) = So(p) 
is the initial surface. U is the unit direction vector and often it represents the 
surface normal vector. Eq. 1 can be either directly provided by the user, or more 
generally, obtained as a gradient descent flow by the Euler-Lagrange equation of 
some underlying energy functionals based on the calculus of variations. 
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2.1 Model Refinement 

Once an initial shape of the object is recovered, the model can be further refined 
several times to improve the fitting accuracy. In this paper, we have implemented 
two kinds of model refinement: global refinement and local/adaptive refinement. 
Global refinement is conducted by Loop’s subdivision scheme [6]. 

Adaptive refinement is guided by fitting accuracy as measured by the variance 
of the distance from the triangle to the boundary of the object [36] . If the variance 
of the distance samples for a given triangle is bigger than a user defined threshold, 
then this triangle will be refined. The variance of a discrete set of distances is 
computed in the standard way: Vr[d] = E[d 2 ] — E[d} 2 , where E denotes the mean 
of its argument. To calculate the variance of the distance samples for a given 
triangle, we temporarily quadrisect the triangle T into four smaller triangles 
and for each smaller triangle, calculate the distance at its barycentric center. At 
each level of adaptive refinement, all the triangles with fitting accuracy below 
the user-specified threshold will be quadrisected. The deformation of the model 
will resume only among those newly refined regions. In Fig. 4, we show different 
levels of refinements. 

2.2 Mesh Regularity 

To ensure that the numerical simulation of the deformation process proceeds 
smoothly, we must maintain mesh regularity so that the mesh’s nodes have a 
good distribution, a proper node density, and a good aspect ratio of the triangles. 
This is achieved by the incorporation of a tangential Laplacian operator, and 
three mesh operations: edge split, edge collapse, and edge swap. 

The tangential Laplacian operator is used to maintain good node distribu- 
tion. The Laplacian operator, in its simplest form, moves repeatedly each mesh 
vertex by a displacement equal to a positive scale factor times the average of the 
neighboring vertices. 

When edge lengths fall outside a predefined range, edge splitting and edge 
collapsing are used to keep an appropriate node density. Edge swapping is used 
to ensure a good aspect ratio of the triangles. This can be achieved by forcing 
the average valence to be as close to 6 as possible [20]. An edge is swapped if 
and only if the quantity ^ ( valence(p ) — 6) 2 is minimized after the swapping. 
pea 

2.3 Topology Modification 

In order to recover a shape of arbitrary, unknown topology, the model must 
be able to modify its topology properly whenever necessary. In general, there 
are two kinds of topology operations: (1) Topology Merging, and (2) Topology 
Splitting. 

Topology Merging. We propose a novel method called ’’lazy merging” to han- 
dle topology merging. The basic idea is that whenever two non- neighboring ver- 
tices are too close to each other, they will be deactivated. Topology merging 
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will happen only after the deformation of the model stops and all the vertices 
become non-active. There are three steps in the topology merging operation: (1) 
Collision Detection , ( 2 ) Merging -vertices Clustering , and (3) Multiple- contours 
Stitching. Collision Detection: Collision detection is done hierarchically in two 
different levels: coarser-level and finer-level. Coarser-level collision detection is 
mainly for the purpose of collision exclusion. For each active vertex V, we will 
calculate its distance to all other non- neighboring active vertices. Vertices whose 
distance to the current vertex V is small enough so that no collision might hap- 
pen between, will be passed to the finer-level collision detection. For each face 
/ with three corner points (u, v, w) that is adjacent to one of the vertices being 
passed into the finer level of collision detection, we will calculate the distance 
between a number of sample points au + (3v + 'jw of the face / with barycentric 
coordinates a + (3 + 7 = 1 and the current vertex V. If at least one of these 
distances is smaller than the collision threshold, the two corresponding vertices 
will be marked as merging vertices and will be deactivated. Merging- Vertices 
Clustering: After all the merging vertices have been deactivated, we need to 
divide them into several connected clusters. We randomly pick any merging ver- 
tex and find all of its connected merging vertices by a breadth-first search. We 
continue recursively, until all the merging vertices belong to appropriate merging 
vertex clusters. Then for each cluster of merging vertices, all the interior vertices 
will be removed and the remaining vertices will be put into a linked list. This is 
based on the following observation: when two regions are merging together, only 
the boundary regions will remain, all the interior regions will be burned out (i.e. 
removed). 

Multiple- Contours Stitching: After the merging vertex linked lists have 
been created, we stitch them together in three separate steps: 

1. For each vertex in the linked lists, find its closest vertex in other linked lists. 

2 . Based on the proximity information obtained from the previous step, find a 
pair of vertices A and B such that they are adjacent to each other in the 
linked list L (Fig. 1 (a)), and their closest merging vertices A' and B' are 
also adjacent to each other in the corresponding linked list I/, in addition, 
the closest merging vertices of A! and B' are A and B , respectively. Starting 
from this pair of vertices A and B , iteratively go through the linked lists 
and if possible, connect each pair of adjacent vertices in one linked list to a 
corresponding vertex in another linked list and create a new triangle. 

3. If there are more than two linked lists to be stitched together, then after 
stitching all the corresponding vertices, there may be some in-between gaps 
that need to be filled in. For example, in (Fig. 1 (a)), there is a gap between 
the linked lists L, L' and L” that consists of vertices B , C, C”, C’ and B' . 
We filled in the gap by creating a new vertex E at the center and connecting 
the new vertex E with all the other vertices in the loop (Fig. 1 (b)). 

Topology Splitting. Topology splitting occurs when a part of the surface tends 
to shrink to a single point. In this scenario, the surface has to split up into two 
parts precisely at that location. We use a method similar to [ 22 ]. In particular, 
a split-operation is triggered if there exists three neighboring vertices which are 
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Fig. 1 . (a)(b) Multiple contours stitching, (a) New triangles are created between cor- 
responding vertices, and a gap is created by vertices B, C, C ” , C' , B' . (b) The gap is 
filled in by creating a new vertex E in the center and connecting it with all the other 
vertices in the gap. (c) Topology split by splitting the virtual face ABC into two faces 
whose orientations are opposite to each other. 



interconnected to each other, but the face formed by these three vertices does 
not belong to the model (i.e., a virtual face), and if the length of any of the three 
edges of the virtual face is smaller than the minimum edge length threshold and 
thus needs to be collapsed. For example, in Fig. 1(c), face ABC represents a 
virtual face that needs to be split. We divide the surface exactly at this location 
by cutting it into two open sub-surfaces. Then we close the two split-in-two 
surfaces using two faces A\B\C\ and A 2 C 2 B 2 whose orientations are opposite 
to each other. Finally, we reorganize the neighborhood. 

3 Surface Reconstruction 

We have applied our PDE-based deformable surface to shape reconstruction from 
volumetric data, unorganized 3D point clouds and multi- view 2D images. The 
PDE we used is the general weighted minimal surface flow [5]: 

f)Q 

— = (g(v + H)-Vg-N)N, 5(0) = S 0 (2) 

where S = S(t) is the 3D deformable surface, t is the time parameter, and So is 
the initial shape of the surface. Note that, H is the mean curvature of the surface, 
and N is the unit normal of the surface, v is a constant velocity that will enable 
the convex initial shape to capture non-convex, arbitrary complicated shapes. It 
is also useful to avoid allowing the model to get stuck into local minima during 
the evolution process, g is a monotonic non-increasing, non-negative function 
that is used for interaction of the model with the datasets, and will stop the 
deformation of the model when it reaches the boundary of the object. In essence, 
Eq. 2 controls how each point in the deformable surface should move in order 
to minimize the weighted surface area. Hence, the detected object is given by 
the steady-state solution of the equation: S t = 0, i.e. when the velocity F is 
zero. In order to apply the model to different types of data, we simply need to 
provide the definition of g that is appropriate for the dataset. For example, in 2D 
multi-view based reconstruction, direct checking for visibility and camera pose 
can be easily incorporated into the appropriate control function. 
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3.1 Volumetric Images 



For volumetric data sets, the stopping function g is defined as: 



9{S) 



1 

1 T |V(G’„ * /(5))| 2 



(3) 



where I is the volumetric density function, and G a * I is the smoothed density 
function by convoluting with a Gaussian filter with variance a. 



3.2 Point Clouds 



For surface reconstruction from 3D unorganized point clouds, we use a simplified 
version of Eq. 2 with part Vp • N removed. The stopping function g is: 




if D[p) < T d 
otherwise 



(4) 



where D(jp) is the distance between current position p and its closest data points, 
To is the threshold distance that is decided by the sampling rate of the point 
clouds datasets. In order to efficiently find the closest data points of any given 
position p, we preprocess the point clouds by putting them into a uniform regular 
grid and connecting all the points inside one grid cell by a linked list. The above 
distance threshold Td will stop the movement of the model before it arrives at 
the ’’real” boundary of the object. To reduce the distance from the model to the 
boundary of the object, after the model stops its deformation, we will project 
each vertex point to the local tangent plane of its ^-nearest neighbors. 

The local tangent plane can be estimated using principle component analysis 
(PC A) [15]: For any point p, its local tangent plane can be represented by a 
center point c and a unit normal vector n. The center point c is the centroid of 
the one neighborhood of point p, which is denoted as Nbhd(p). The normal vector 
n is computed by doing eigenanalysis of the covariance matrix C of Nbhd(p ), 
which is a symmetric 3x3 positive semi-definite matrix: 



c = Y c Vi-c ) ® (Pi - C ) (5) 

PiENbhd(p ) 



Here, 0 denotes the outer product vector operator, and the normal vector 
n is the eigenvector associated with the smallest eigenvalue of the covariance 
matrix C . In our experiments, k is set to five. 



3.3 2D Multiple Views 

The new model is capable of recovering shape not only from 3D data but also 
from 2D images. In the case of 2D photographic images, we use a photo consis- 
tency as the weight function g in Eq. 2. Here we use photo consistency criterion 
similar to the one in [21]. We only calculate the photo consistency w.r.t. each 
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of the model vertices. For numerical robustness, the consistency calculation is 
performed by projecting a small patch around each vertex, to the image planes. 
It is assumed that the patch is reasonably small w.r.t. the distance between the 
object and the camera, while it is large enough to capture local features over 
images. A reasonable planar approximation for Lambertian surfaces is to take a 
patch on the tangent plane to the vertex. Again, we can use PC A to estimate 
the normal of the tangent plane, which is actually the first order approximation 
for the current model surface. In order to find the correct correspondence of a 
point i\ in correlation window A to a point 12 in correlation window B ( A and 
B are projections of surface patch P onto different images), we back-project i\ 
to a point p\ E P and then reproject pi to i 2 in correlation window B. Then we 
can calculate the photo consistency within patch projections in different views 
as discussed in [21]: 



g = a 2 = (J 2 r + (J 2 g + (J 2 b , 



P - 



N 



N 



= jvbE^f - (iE^) 2 



t 2 - 



N 



N 



N 



N 



= ivbEGf-(iEG i ) 2 , 4 = ivbE^ 2 -(iE^) 2 



(6) 



where N is the number of selected views. We only select the best-N views for 
our reconstruction, ignoring the most degenerate views. For a particular camera, 
using the OpenGL depth buffer, we first check visibility based on the current 
model. The visibility info can be calculated for every iteration with the com- 
plexity O (ran), where m is the number of total cameras, and n is the number of 
surface vertices. 

Among the visible points, we take those with the largest projection area to 
the image plane. The projection area gives both distance and pose information. 



SO coso Z 2 
If = cos 0 V 



( 7 ) 



Using the geometric properties of solid angles [35], we derive our camera weight 
Wij from Eq. 7 

Wi,j = (cosaV (j>) 2 > = 0 when not visible (8) 

In Fig. 2, we show how to evaluate camera position and orientation. 

Using Eq. 2, we can set the speed function for each vertex. We can start 
from a coarse mesh (like a cube), and subdivide it after it is close to the surface 




Fig. 2. Selecting the best-N views by camera positions 
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of the object. More specifically, we can use adaptive mesh refinement to locally 
subdivide areas with large curvature, where most details exist. The regularization 
term helps to keep the surface smooth and to make it converge to the final result. 
Since we have the explicit representation of the object surface, we can always 
incorporate new images taken from new camera positions, and incrementally 
enhance the quality of the final reconstruction. The complete algorithm is: 

1. Preprocessing: For each camera position check visibility using the depth 
buffer value of each vertex. 

2. For each vertex point on the model: 

a) For each camera, check camera visibility and pose. 

b) Select the best N-views. 

c) Calculate the photo consistency g at the vertex position (first order patch 
approximation) . 

d) Calculate the gradient V(g) at the vertex position. 

e) Get the vertex speed from the PDE of Eq. 2, and the new vertex position 
in the next iteration. 

3. Laplacian regularization. 

4. Mesh adaptive refinement if necessary. 

4 Experimental Results 

Results from 3D Data: In this section, we will show experimental results 
on both real and synthetic datasets. In all the following figures, grey regions 
represent parts of the model that are still active and deforming, black regions 
represent deactivated parts of the model that have already reached the boundary 
of the object. In order to illustrate the good mesh quality generated by our new 
model, we will show at least one wireframe view of the model in all the following 
figures. The input volumetric dataset of Fig. 3 is obtained from CT scanning of 
a phantom of the vertebra. The data size is 128 x 120 x 52 voxels. Fig. 4 and 
Fig. 5 illustrate the surface reconstruction process from 3D unorganized point 
clouds. The input dataset of Fig. 4 is the mannequin head with 7440 data points. 
The original dataset has a big hole in the bottom of the head. We have manually 
filled in the hole since our model currently can only handle closed shapes. The 
input dataset of Fig. 5 is obtained by sampling a subdivision surface with 6140 
data points. 

Results from 2D Multiple Views: Fig. 6(b) is the reconstruction from a 
synthesized torus, demonstrating topology changes while recovering shape. The 
original dimensions of the synthesized torus are: 206 x 205 x 57. Compared to 
ground truth, we get min error of 0.166, max error of 14.231, mean error of 2.544 
and RMS of 1.087. Fig. 6(d) is the reconstruction of a mug from 16 real images 
that proves the ability of recovering the underlying topology from 2-D images. 
The use of local adaptive refinement on the top allows the model to get the 
detail of the small grip on the lid of the mug. The multi-resolution procedure 
has two levels of detail in Fig. 7. Fig. 7(b) is the result after adaptive refinement. 
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Fig. 3. Segmentation of a CT volumetric dataset of a human vertebra, (a) Initial model 
and input data; (b)-(e) model evolving; (f) mesh model; (g) optimized mesh model; (h) 
shaded model. 




Fig. 4. Reconstruction from point-clouds data of mannequin head, (a) model initial- 
ization; (b) and (c) during deformation; (d)(e) final shape of the model; (f)(g) are two 
different levels of adaptive refinement; (h) is shaded result 



It increases the resolution while avoiding global subdivision. Fig. 8(a) shows the 
positions of 16 real images of a Buddha figure. The 2 images with question marks 
(one of which is shown in 8(b)) were not used in the reconstruction. Fig. 8(c) is 
the final texture mapped mesh rendered from a similar viewpoint as the image 
in 8(b). Fig. 9 shows an example of incremental reconstruction. Fig. 9(a) is the 
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Fig. 5. (a) model initialization; (b) (c) during deformation, dark area stands for non- 
active, while grey is active; (d) final shape; (e) one level of refinement 




(a) (b) (c) (d) 

Fig. 6. Fitting to synthesized images with topology changes (a) one of 24 synthetic 
images of torus; (b) final mesh with 897 vertices after fitting to images (mean error of 
2.544); (c) one of 16 real image of a mug; (d) final textured mesh after fitting to the 
images, the fine detail at the top is the result of local adaptive refinement. 




(a) (b) (c) 



Fig. 7. Adaptive reconstruction, (a) initial reconstruction; (b) reconstruction after one 
level of adaptive refinement, the two handles of the cup are correctly recovered, (c) 
textured results from (b) 



reconstruction result from 6 frontal images, and 9(b) is its mesh representation. 
The back of the model has not deformed due to the lack of image data. 9(c) is 
one of the 5 images added later. After adding images, the model further deforms 
in 9(d), and finally captures the complete shape shown in 9(e) and 9(f). 

Table 1 gives the information of the recovered shape, including the number 
of vertices, edges and faces for each model, and the running time. The running 
time is measured on an Intel Pentium 4M 1.6GHZ Notebook PC with 384MB 
internal memory. 
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Fig. 8. (a) position of 16 real images of Buddha, the 2 images with question marks 
were not used in the reconstruction, one of them is shown in (b); (c) final textured 
mapped mesh rendered from a similar view point as images (b). 




(a) (b) (c) (d) (e) (f) 



Fig. 9. Incremental reconstruction, (a): partial reconstruction result from 6 frontal 
images (Black area is not recovered due to lack of image of the data for that part 
of the model ); (b): untextured fitting mesh of (a); (c): one of the 5 images (back 
views) added later; (d): intermediate result after adding more images; (e) complete 
reconstruction result; (f) mesh result of (e). 

Table 1. Recovered shape information 



Figure 


Vertices 


Faces 


Edges 


Time (sec) 


Figure 


Vertices 


Faces 


Edges 


Time (sec) 


4(e) 


466 


928 


1392 


23 


5(d) 


723 


4774 


7161 


18 


4(f) 


3501 


6998 


10497 


19 


5(e) 


2895 


8688 


5792 


9 


4(g) 


8386 


16768 


25152 


119 


7(a) 


1513 


3018 


4527 


223 


9(f) 


1734 


3464 


5196 


191 


7(b) 


2770 


5540 


8310 


584 



5 Discussion 

In this paper, we proposed a new PDE-based deformable surface that is ca- 
pable of automatically evolving its shape to capture geometric boundaries and 
simultaneously discover their underlying topological structure. The deformation 
behavior of the model is governed by partial differential equations, that are de- 
rived by the principle of variational analysis. The model ensures regularity and 
stability, and it can accurately represent sharp features. We have applied our 
model to shape reconstruction from volumetric data, unorganized point clouds 
and multi-view 2D images. The characteristics of the model make it especially 
useful in recovering 3D shape out of 2D multi-view images, handling visibility, 
occlusion and topology changes explicitly. The existence of a mesh representation 
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allows progressive refinement of the model when appropriate. Our mathematical 
formulation allows us to use the same model for different types of data, simply 
by using the appropriate data interface function. We plan to further exploit this 
property in future work to apply the model to heterogeneous data such as points, 
surfels, images and to incorporate other visual cues such as shading and optical 
flow. 
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Abstract. Vision (both using one-dimensional and two-dimensional 
retina) is useful for the autonomous navigation of vehicles. In this pa- 
per the case of a vehicle equipped with multiple cameras with non- 
overlapping views is considered. The geometry and algebra of such a 
moving platform of cameras are considered. In particular we formulate 
and solve structure and motion problems for a few novel cases of such 
moving platforms. For the case of two-dimensional retina cameras (ordi- 
nary cameras) there are two minimal cases of three points in two plat- 
form positions and two points in three platform positions. For the case 
of one-dimensional retina cameras there are three minimal structure and 
motion problems. In this paper we consider one of these (6 points in 3 
platform positions). The theory has been tested on synthetic data. 



1 Introduction 

Vision (both using one-dimensional and two-dimensional retina) is useful for the 
autonomous navigation of vehicles. An interesting case is here when the vehicle is 
equipped with multiple cameras with different focal points pointing in different 
directions. 

Our personal motivation for this work stems from the autonomously 
guided vehicles, called AGV, which are important components for factory 
automation. Such vehicles have traditionally been guided by wires buried in the 
factory floor. This gives a very rigid system. The removal and change of wires 
are cumbersome and costly. The system can be drastically simplified using navi- 
gation methods based on laser or vision sensors and computer vision algorithms. 
With such a system the position of the vehicle can be computed instantly. The 
vehicle can then be guided along any feasible path in the room. 

Note that the discussion here is focused on finding initial estimates of struc- 
ture and motion. In practice it is necessary to refine these estimates using non- 
linear optimisation or bundle adjustment, cf. [Sla80,Ast96]. 

Structure and motion recovery from a sequence of images is a classical prob- 
lem within computer vision. A good overview of the techniques available for 
structure and motion recovery can be found in [HZOO]. Much is known about 
minimal cases, feature detection, tracking and structure and motion recovery 
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has been built, [Nis01,PKVG98]. Such systems are however difficult to build. 
There are ambiguous configurations [KHA01] for which structure and motion 
recovery is impossible. Many automatic systems rely on small image motions in 
order to solve the correspondence problem. In combination with most cameras 
small field of view, this limits the way the camera can be moved in order to 
make good 3D reconstruction. The problem is significantly more stable with a 
large field of view [O A98] . Recently there have been attempts at using rigs with 
several simple cameras in order to overcome this problem, cf. [Ple03,BFA01]. 
In this paper we study and solve some of the minimal cases for multi-camera 
platforms. Such solutions are necessary components for automatic structure and 
motion recovery systems. 

The paper is organised as follows. In section 2 there are some common the- 
ory and problem formulation and in the following sections, the studies of three 
minimal cases. 

2 The Geometry of Vision from a Moving Platform 

In this paper we consider a platform (a vehicle, robot, car) moving with a planar 
motion. This vehicle has a number of cameras with different camera centres 
facing outwards so that they cover different viewing angles. The purpose here 
is to get a large combined field of view using simple and cheap cameras. The 
cameras are assumed to have different camera centres but it is assumed that 
the cameras are calibrated relative to the vehicle, i.e. it is assumed that the 
camera matrices are known for all cameras on the vehicle. It is known that 
a very large field of view improves stability [BFA01] and Pless derives the basic 
equations to deal with multiple cameras [Ple03]. 

As both ID and 2D-retina cameras are studied the equations of both will be 
introduced in parallel, ID cameras on the left hand and 2D on the right hand 
side of the paper. 

Both types can be modeled Au = PU , where the camera matrix P is a 2 x 3 
or 3 x 4 matrix. A scene point U is in P 2 or P 3 and a measured image point 



X 




Fig. 1 . Three calibrated cameras with constant and known relative positions taking 
two images each 
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u is in P 1 or P 2 . It is sometimes useful to consider dual image coordinates. In 
the one-dimensional retina case each image point u is dual to a vector v, with 
vu = 0 and in the 2D case there are two linearly independent dual vectors v 1 
and v 2 with v*u = 0. The measurement equation then becomes 

vPU = 0 or v x PU = v 2 PU = 0. 



As the platform moves the cameras move together. This is modeled as a trans- 
formation Si between the first position and position i. In the original coordinate 
system the camera matrix for camera j at position i is PjS^. 

It is assumed here that the camera views do not necessarily have common 
points. In other words, a point is typically seen by only one camera. On the other 
hand it can be assumed that in a couple of neighboring frames a point can be 
seen in the same camera. Assume here that point j is visible in camera j. The 
measurement equation for the n points is then 



X{j U ij P j S j, U j i j 1 , . . . , Tl , i 1 , . . . , TYl . 



Using the dual image coordinates we obtain 



VijPjSiUj =0 or 



r v',p ( s,u, = o, 

I Vy PjSjUj = (I, 



j = 1,... ,n,i = 1,... ,m, 



for the one respectively two-dimensional retina case. 

Note that Ijj = v^-Pj and lh = v^-Pj correspond to the viewing line 
or viewing plane in the vehicle coordinate system. Thus the constraint can be 
written 



bjS^Uj =0 or 



f = 0, 

1 ifjSiUj = Oj 



j = ,n,i = 1,... , m. 



(1) 



Here the lines or planes 1 are measured. The question is if one can calculate 
structure U j and motion from these measurements. Based on the previous 
sections, the structure and motion problem will now be defined. 



Problem 1. Given the mn images Uij of n points from m different platform 
positions and the camera matrices Pj the surveying problem is to find re- 
constructed points U j and platform transformations S i such that 



XijUij = PjSiUj , Vi = 1 , . . . ,m, j = 1 , . . . ,n 



for some A ij. 



2.1 Minimal Cases 

In order to understand how much information is needed in order to solve the 
structure and motion problem, it is useful to calculate the number of degrees of 
freedom of the problem and the number of constraints given by the projection 
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equation. Each object point has two degrees of freedom in the two-dimensional 
world and three in the three-dimensional world. Vehicle location for a planarily 
moving vehicle has three degrees of freedom when using of + b? = 1, that is 
Euclidian information and four degrees of freedom in the similarity case when 
not using this information. The word “image “ is used to mean all the information 
collected by all our cameras at one instant in time. 

For Euclidian reconstruction in three dimensions there are 2mn— (3n + 3(m — 
1)) excess constraints and as seen in table 1 there are two interesting cases 

1. two images and three points (m=2,n=3). 

2. three images and two points (m=3,n=2). 

For the similarity case in two dimensions there are mn — (2 n + 4(m — 1)) 
excess constraints and as seen in table 1 there are three interesting cases 

1. three images of eight points (m=3, n=8). 

2. four images of six points (m=4, n=6). 

3. six images of five points (m=6, n=5). 

For the Euclidean case in two dimensions there are mn — (2 n + 3(m — 1)) 
excess constraints and as seen in table 1 there are three interesting cases 

1. three images of six points (m=3, n=6). 

2. four images of five points (m=4, n=5) overdeter mined. 

3. five images of four points (m=5, n=4). 

All these will be called the minimal cases of the structure and motion 
problem. 



Table 1. The number of excess constraints 





2D Similarity 
m 






2D Euclidian 
m 






3D Euclidian 
m 


n 


12 3 4 


5 


6 


7 


n 


1 2 3 4 5 


6 


~T~ 


n 


1 2 


3 4 5 6 7 


T 


-1-4 -7-10 


-13 


-16 


-19 


T 


-1 -3 -5 -7 -9 


-11 


M3 
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-1 -2 


-3 -4 -5 -6 -7 
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-2-4-6 -8 


-10 


-12 
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-2 -3 -4 -5 -6 


-7 


-8 
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-2 -1 


0 12 3 4 


3 


-3-4-5 -6 


-7 


-8 
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-3 -3 -3 -3 -3 


-3 


-3 
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-3 0 


3 6 9 12 15 
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-4-4-4 -4 


-4 


-4 


-4 
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-4 -3 -2-1 0 
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4 


-4 1 


6 11 16 21 26 
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-5-4-3 -2 


-1 
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-5 -3-11 3 


5 


7 


5 


-5 2 


9 16 23 30 37 


6 


-6-4-2 0 


2 
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-6 -3 0 3 6 


9 


12 
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-6 3 


12 21 30 39 48 
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-7-4-1 2 


5 


8 


11 
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17 
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-7 4 


15 26 37 48 59 
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21 36 51 66 81 




256 



H. Stewenius and K. Astrom 



3 Two-Dimensional Retina, Two Positions, and Three 
Points 

Theorem 1. For three calibrated cameras that are rigidly fixed with a known 
transformation relative to each other, each taking an image of a point at two 
distinct times there generally exist one or three real non-trivial solutions. 

Pure translation is a degenerate case and can only be computed up to scale. 

Proof. See solution procedure. 

Departing from equation (1) for two observations of point j gives 

pySil 

lt s i XJ _ o 

iLs 2 Uj - 0 ' 

Mj 

As U j 7 ^ 0 there is a nonzero solution to the above homogeneous linear system 
i.e. 



det Mj = 0 . 



( 2 ) 



Note that the constraint above is in essence the same as in [Ple03]. The 
same constraint can be formulated as LJFL 2 = 0 , where L\ and L 2 are pliicker 
coordinate vectors for the space lines and F is a 6 x 6 matrix representing relative 
motion of the two platforms. 

By a suitable choice of coordinate system it can be assumed that So = /, 
that is, ao = 60 = Co = (io = 0 and to reduce the number of indices we set 
a = ai,b = b\,c = c\ and d = d\. The planes defined by 1 an d 1 \j 

all comes from in camera j and pass through this camera centre in the vehicle 
coordinate system, that is through a common point, implying 



det 




= 0 . 



Computing the determinant in equation ( 2 ) gives 



a a a + a a jb + a c jC + a d jd + otAj{da + cb ) + a B j(db — ca) = 0 

where otj = l 2 ^, l^-, ) - The assumption of rigid movement is equivalent 

to (l + a) 2 +5 2 — 1 = 0. With three points observed, one in each of the three 
cameras the above gives four polynomials in (a, 6, c, d). After homogenisation 
with t the polynomial equations are 

' /1 = a a i at + a b \bt + OL c \et + adidt + aAi(o>d + be) + a B i(ac — bd) = 0 

^ $2 — CV a 2 a t + OL^bt + OL c 2Ct + QL^dt + + be) + 0^2 — bd) = 0 , , 

h = OL a $at + a b3 bt + a c sct + a d sdt + cq 4 3 (ad + be) + a B s(ac -bd) = 0 ' ' 

f ^ — n 2 T 2 at + b 2 = 0. 
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Lemma 1 . (a, 6, c, d, t) = (0, 0, 0, 0, 1) and (a, 6, c, d, £) = A(l, ±i, 0, 0, 0) are so- 
lutions to equation (3). 

Proof. This follows by inserting the solutions in the equations. 



3.1 Solving 

Equation (3) is solved by using different combinations of the original equations 
to construct a polynomial in b which gives roots that can be used to solve for 
the other variables. 

As equation (3) is homogeneous it can be de- homogenised by studying two cases, 
a = 0 and a — 1. 

a = 0 gives 5 = 0 (by = 0). Now remains 

{ t(a cl c +a dl d) = 0 
t(a c2 c +a d2 d) = 0 
t(a c3 c +a d3 d) = 0, 

which in general has only the trivial solutions (c, d,t) = (0, 0, t) and (c, d, t) = 
(c, d, 0). If the 3 equations are linearly dependent more solutions exist on the form 
(c, d, t) = ( k c s,k d s,t ). This means that pure translation can only be computed 
up to scale. 

a = 1 gives 



{ fi — Otiat + CVibbt + (a!i c + OLiB)ct + (oLi d + OLiA)dt 

+oi.iA(d + be) + OLisip — bd) =0 z = 1,2,3 

^4 = 1 T 2t T 5^ = 0. 

The 24 coefficients of the 17 polynomials 

fi, fib, i = 1,2,3 

hdb, f 4 d, f 4 cb, f 4 c, f 4 b 3 , f 4 b 2 , f 4 b, f 4 b 



are ordered in the lex order t < d < c < b into a 17 x 24 matrix M with one 
polynomial per row. As no solutions are lost by multiplying by a polynomial and 
the original polynomials are in the set, this implies 

MX = 0, (4) 



where X = [Xf X£] and 

X 1 = [tdb 2 , tdb , td, teb 2 , £c6, £c, tb 3 , tb 2 ,tb,t, db 3 , db 2 , d5, d, cb 3 , cb 2 , c5, c] T 

v 2 = [b 5 , 6 4 , b 3 , b 2 , b,l] T . 



Dividing M = [Mi M 2 ] where M\ is 17 x 18 and M 2 is 17 x 6, equation (4) 
can be written M\X\ + M 2 X 2 = 0. Unfortunately these pages are too few to 
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show the resulting matrices but it is easily proven that rank Mi < 16, the easiest 
way to do this is to use the fact that rank Mi = rankM^ and it is easy to find 
two vectors in the null-space of Mj. Therefore there exist v s.t. vM\ = 0 and 
0 = OX = vMX = vM\X\ + VM 2 X 2 = VM 2 X 2 that is, a fifth order polynomial 
Pi(b) = vM 2 X 2 = 0. Knowing that b = ±i are solutions to this equation, a 
new polynomial P2(b) = pi(b)/(l + b 2 ) can be calculated and then solved for the 
remaining roots. 

It is then possible to use to solve for t. Knowing t it is possible to change 
back to the original non- homogeneous variables (a, 5, c, d) and solve for c and d. 



3.2 Implementation and Code 

The implementation is quite straightforward and the code is available for down- 
load [Ste] . As /4 have fixed coefficients and the result is divided by (1 + b 2 ) it is 
possible to reduce M to a 9 x 12 matrix with Mi of size 9xl0or9x8if the 
missing rank is used as well. M 2 is of size 9x6 but can be simplified to 9 x 4 
as the roots b = Pi are known. Reducing the size of M is useful as the highest 
number of computations come from finding the null-space of Mi. 



4 Two-Dimensional Retina, Three Positions, and Two 
Points 



Theorem 2. Two cameras mounted rigidly with a known transformation with 
respect to each other for which calibration as well as relative positions are known 
are moved planarily to 3 different stations where they observe one point per 
camera. Under these circumstances there generally exist one or three non-trivial 
real solutions. 

Pure translation is a degenerate case and can only be computed up to scale. 
Proof. The existence of a solver giving that number of solutions. 



A point j observed in 3 images of the same camera gives 




Vi = 0 . 



As a non-zero solution exists rankMj < 3. This is equivalent to the fact that 
det ( M su i ) ) = 0 for all 4 x 4 sub-matrices of M. With a suitable choice of coor- 
dinate system Sq = /, that is, ao = bo = cq = do = 0 and the 15 sub-matrices of 
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Mj can be computed and this gives 15 polynomial equations of the second and 
third degree in (ai, 61 , ci, di, < 22 , 62 , C 2 , ^ 2 )- Rigid planar motion implies 

af + 2di + 6 ^ = 0 i — 1 , 2 . 

Observing two points through different but relatively fixed cameras cameras at 
3 instants gives total of 32 equations in 8 unknowns. 



4.1 Solving 

A solver inspired by the previous case has been built but it is still very slow. 
Anyhow it generates a fifth order polynomial in b\ for which we know the solu- 
tions bi = ± i. 

If a\ = d 2 = 0 then bi = 62 = 0 and the system is degenerate and the solution 
for (ci, di, C 2 , ^ 2 ) can only be computed up to scale. 



4.2 Implementation and Code 

The solver is still in an early stage and very slow. The matrix M will in this case 
be much larger than in the previous case but the simplifications used there will 
be possible here as well. As soon as a decent implementation is ready it will be 
available for download. 

5 One-Dimensional Retina, Three Positions, and 
Six/Eight Points 

5.1 Intersection and the Discrete Trilinear Constraint 

In this section we will try to use the same technique for solving the structure and 
motion problem as in [AOOO] . The idea is to study the equations for a particular 
point for three views. The fact that three planar lines intersect in a point gives 
a constraint that we are going to study. 

The case of three cameras is of particular importance. Using three measured 
bearings from three different known locations, the object point is found by inter- 
secting three lines. This is only possible if the three lines actually do intersect. 
This gives an additional constraint, which can be formulated in the following 
way 

Theorem 3. Let lij, hj and I 3 j be the bearing directions to the same object 
point from three different camera states. Then the trilinear constraint 

Y / T pqr h j , P h j , q hj,r = 0, ( 5 ) 

p,q,r 



is fulfilled for some 3x3x3 tensor T. 
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Proof. By lining up the line equations 

fhjSA 

i 2>i s 2 (u,) = o 

Wv 

M 

we see that the 3 x 3 matrix M has a non-trivial right-nullspace. Therefore its 
determinant is zero. Since the determinant is linear in each row it follows that 
it can be written as 



det M=J2 T p9r l y>p l 2 ,. ? l 3i , r = 0 , 

P-.Q^r 



for some 3 x 3 x 3 tensor T. Here lij >p denotes element p of vector 1 -| :j and similarly 
for 1 2 j,q and 1 3j>r . 



The tensor T = T pqr in ( 5 ) will now be analysed in more detail. The mapping 
from the motion parameters (Si, S 2 , S3) to the tensor T is invariant to changes 
of the coordinate system, i.e. by multiplying each of the transformation matrices 
with the same matrix. Thus without loss of generality one may assume that 
Si = I. Introducing parameterisations according to ( 1 ) the tensor components 
are 



T 111 — 62C3 — &3C2, T 112 — b 2 d 3 — a 3 c 2 , 

T 121 = 02C3 — b 3 d 2 , T 122 = a 2 c?3 — a^d2, 

T 131 = -63, T 132 = — a 3 , 

T 211 = — CL2C3 + a 3 C2, T 212 = —a 2 d 3 — &3C2, 

T 221 = b 2 c 3 + a 3 d 2 , T 222 = b 2 d 3 - b 3 d 2 , 

T 231 = a 3 , T 232 = -63, 

T 311 = a 2 b 3 — a 3 b 2 , T 312 = a 2 a 3 + b 3 b 2 , 

T 321 = -6362 - a 2 a 3 , T 322 = a 2 b 3 - a 3 b 2 , 
T 331 = 0 , T 332 = 0 , 



T 113 = 62, 
T 123 = a 2 , 
T 133 = 0 , 
T 213 = 
T 223 = b 2 , 
T 233 = 0 , 
T 313 = 0 , 
T 323 = 0 , 
T 333 = 0 



(6) 



Note that the tensors have a number of zero components. It can in be shown 
that there are 15 linearly independent linear constraints on the tensor compo- 
nents. These are 



I' rpl 33 _ rp 233 _ rp 313 _ rp 323 _ rp 331 _ rp 332 _ rp 333 _ Q 

rpl 31 rp 232 _ P '132 _|_ p 231 _ pll 3 p 223 _ q 

pl 23 _|_ p 213 p 311 p 322 p 321 _|_ p 312 q 

pill pl 22 p 212 p 221 q 

pll 2 pl 21 p 211 p 222 q 



< 



( 7 ) 
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There are also four non-linear constraints, i.e. 

rp312 ^rpl23rpl31 _|_ rp231rp223j _|_ rp311 ^rpl23rp231 rpl31rp223^ Q 

rp231rp223rpl31rp221 _|_ rp223rp231 rp231 rpl21 rpl23rp231 rp231 rplll 

^rj^231rj^223rjil31rplll rj~il23 rjU31 r|-il31 rp221 _|_ rpl23rpl31 rp231 rpl21 

_|_rjU23rj-U31 rp231 rj~i211 rj^223rj^l31rjU31rji211 q 

rpl31rpl23rp222 rpl31 rp!21 rp!23 rp!31 rp!23rp211 _|_ rp231 rplllrpl23 



(8) 

(9) 

( 10 ) 

( 11 ) 

If only Euclidean transformations of the platform are allowed, which is reason- 
able, there are two additional constraints 



rj~il31 rjU22rji223 

rjU23 rji231 rjU22 rp223 



rp231 r^l23rpl23rp222 
_rj-U31 rjU23rjU23rp221 rp223 rj-U31 rj-U23 rp222 



rp231 rj-U21 rp223 q 

rp 1 2 3 rp 2 3 1 rp 2 2 1 rp 2 2 3 
rp!31 rpl22 rp223rp223 



231 rpl21rp223rp223 



| rj^Zdl 



rp223rpl31 rpl21 rpl23 Q 



( T 1 23 ) 2 + (T 113 ) 2 = a\ + b\ = 1, 

( T 23i) 2 + (t 232 ) 2 = a 



H = i. 



( 12 ) 

(13) 



These two last constraints are true only if the tensors are considered to be 
normalised with respect to scale. It is straightforward to generate corresponding 
non-normalised (and thus also homogeneous) constraints. 

It is natural to think of the tensor as being defined only up to scale. Two 
tensors T and T are considered equal if they differ only by a scale factor 

T ~ T. 



Let T denote the set of equivalence classes of trilinear tensors fulfilling equations 
(7)-(13). 

Definition 1. Let the manifold of relative motion of three platform positions 
be defined as the set of equivalence classes of three transformations 



a i bj Cj ^ 

M s = { (S 1 ,S 2 ,S 3 )|S / = ( -b iai dj 

x ° ° 1 

where the equivalence is defined as 

(Si, s 2 , s 3 ) ^ (S u s 2 , s 3 ), as g <s,Sj ~ s 7 s , i = 1, 2, 3 . 



Thus the above discussion states that the map (Si, S 2 , S 3 ) i— T is in fact a 
well defined map from the manifold of equivalence classes Ms to T. 

Theorem 4. A tensor T pqr is a calibrated trilinear tensor if and only if equa- 
tions (7)- (IS) are fulfilled. When these constraints are fulfilled it is possible to 
solve (6) for S 2; S 3 . The solution is in general unique. 



Corollary 1. The map 



T :M S 



T 



is a well defined one-to-one mapping. 
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5.2 Algorithm 

The previous section on the calibrated trilinear tensor has provided us with the 
tool for solving the structure and motion problem for three platform positions 
of at least eight points. 

Algorithm 51 (Structure and Motion from Three Platform Motions) 

1. Given three images of at least six points, 

u ij, i = 1,... ,3,j = 1,... ,n,n > 6. 

2. Calculate all possible trilinear tensors T that fulfills the constraints (7) to 
(13) and J2 P ,q, r TPqrl ij,phj,qhj,r = o, Vj = 1 

3. Calculate the platform motions (Si, S 2 , S 3 ) from T according to the proof of 
Theorem 4- 

4 . For each solution to the motion calculate structure using intersection. 



5.3 Homotopy Studies of Six Points in Three Stations 

In step 2 of the above algorithm one has to find all solutions to a system of 
polynomial equations. We have not yet solved this system in detail, but rather 
experimented with simulated data and a numerical solver of such system of 
equations. 

When studying the equations of six points and three stations in two dimen- 
sions under Euclidian assumption (a? + b\ = 1) in the dual points formulation 
there are 8 unknowns and 8 variables. By inserting these equations into the ho- 
motopy software PHC-pack [Ver99] it is found that the mixed volume [CL098] 
is 39 and there are up to 25 real solutions. 

Based on these experimental investigations we postulate that there are in 
general up to 25 real solutions to the problem of 6 points in 3 images. 

6 Conclusions 

In this paper we have introduced the structure and motion problem for the 
notion of a platform of moving cameras. Three particular cases, (i) eight points 
in three one-dimensional views, (ii) three points in two two-dimensional views 
and (iii) two points in three two-dimensional views have been studied. Solutions 
to these problems are useful for structure and motion estimation of autonomous 
vehicles equipped with multiple cameras. 

The existence of a fast solver for the two images and three points case in three 
dimensions is of interest when computing RANSAC. It is important to note that 
pure translation is a degenerate case and that the solution in this case suffers 
from the same unknown scale as for single camera solutions. Another important 
aspect is that the cameras has to have separate focal points. 
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Abstract. A detection and tracking approach is proposed for line 
scratch removal in a digital film restoration process. Unlike random im- 
pulsive distortions such as dirt spots, line scratch artifacts persist across 
several frames. Hence, motion compensated methods will fail, as well as 
single-frame methods if scratches are unsteady or fragmented. 

The proposed method uses as input projections of each image of the in- 
put sequence. First, a lD-extrema detector provides candidates. Next, 
a MHT (Multiple Hypothesis Tracker) uses these candidates to create 
and keep multiple hypothesis. As the tracking goes further through the 
sequence, each hypothesis gains or looses evidence. To avoid a combina- 
torial explosion, the hypothesis tree is sequentially pruned, preserving 
a list of the best ones. An energy function (quality of the candidates, 
comparison to a model) is used for the path hypothesis sorting. As hy- 
pothesis are set up at each iteration, even if no information is available, a 
tracked path might cross gaps (missed detection or speckled scratches). 
At last, the tracking stage feeds the correction process. Since this con- 
tribution focus on the detection stage, only tracking results are given. 



1 Introduction 

Despite of fast-growing use of digital media, the photochemical film is still the 
storage base in the motion picture industry and several million reels are stored 
at film archives. Film is a good medium for long term storage, but future mass- 
migration to digital media is ineluctable and digital processing at this step 
could ensure the removal of the various, typical film-related damages, see fig- 
ure 1. Though traditional restoration techniques are necessary (the film should 
be able to withstand mechanically the digitisation step), digital restoration lets 
us expect results beyond today’s limitations (automated processing, correction 
of previously photographed artifacts, etc.). 

Digital restoration has only very recently been explored [1,2,3]. The main 
visual defects are dust spots, hairs and dirt, instabilities (both exposition and 
position) and scratches, some of them are now easily detected and removed, 
especially if the defect appears only in a single frame. This is not the case for 
scratches. Scratches are mainly vertical (parallel to the film transport direction), 
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Fig. 1 . Film structure and film damage 



caused by slippage and abrasion during fast starts, stops and rewinding. Because 
the scratch is spread over many frames, and appears at the same location during 
projection, this damage is readily seen by the viewer, and also difficult to detect 
and correct using image processing. 1 

Early work about digital line scratch removal can be related to Anil C. 
Kokaram’s research activities [4,5,6]. His detection scheme, based on vertical 
mean, is still used today. Other approaches use vertical projections and local 
maxima or minima detection. Bretschneider et A1 [7,8] suggest a wavelet decom- 
position using the low frequency image and the vertical components for a fast 
detection. Some recent work [9,10] improve Kokaram’s approach, but most of 
the techniques are intraframe methods, neglecting the scratch tracking [11]. 

In our approach, we consider a large number of test sequences (old footage 
and new shoots). We state that a tracking mechanism considerably increases the 
detection quality because line scratches can be very unsteady. The x-position of 
the line scratch can move sideways up to 10 % of the image width (see figure 2). 
Consequently, the intra-frame shape of the scratch is not perfectly vertical and 
the corresponding slope might reach 5 degrees. All the methods based on full 
frame projection or vertical mean fail in this case. 

A tracking improves the detection as well, essentially in noisy images. The 
localisation of the scratch detection is better and therefore its correction as 
well. At last, since a line scratch has a continuous life over many frames, our 
method allows an inter-frame tracking in order to assign a unique identifier to 
the detected scratch for its entire lifetime. This is important for our user interface 
(selection on a per-scratch basis instead of a per- frame basis). 

The present work deals essentially with persistent line scratches (several con- 
secutive frames). Other methods based on motion compensation and temporal 
discontinuity in image brightness are more suitable for short line scratches (ap- 

1 Proper film digitisation requires a wet gate process, which dramatically reduces 
visible scratches. In a wet gate, a liquid (perchloroethylene) fills the gaps, and most 
of fine scratches are no longer visible. But the wet gate process requires the use of 
chemicals and is not very compatible with a high digitalisation throughput 
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Fig. 2. Image exhibiting a notably slanted scratch, from the “maree noire, colere rouge” 
documentary and tracked path over 9 consecutive frames. Shown image is the 2 nd one. 



pearing randomly on a single frame only), as well as dust spots. Those methods 
fail with persistent scratches, present on the previous/next frame at nearly the 
same position and consequently matched and labelled as part of the scene. 



2 Pre-processing : Image Projection 



Though we cannot assume line scratches to be vertical over all an image, this 
hypothesis is locally true for a few consecutive horizontal lines in the original 
image /. We consider that scratch abscissa is locally constant over a band of H 
lines of /. Several advantages direct us to work with an image P, the vertically 
sub-sampled projection of the original image /. Each line of P is the vertical 
mean value of H lines of / (we call H the projection height) : 



P(x,y) 



H—l 



E 



I(x, y x H + i) 

H 



(i) 



— The amount of data (and processing time) is reduced by a factor of H. 

— Noise is reduced by y/[iL) if gaussian. 

— Line scratches intensity remains unaltered (assumed constant over H lines). 

This simple method gives very good results, though more complex projection 
schemes may be used, for example overlapping bands or a weighted mean. Let 
us emphase that the H parameter is of primary importance, because it will im- 
pact all the remaining processing steps and determine the maximum detectable 
scratch slope q. Above this maximum slope, scratches become attenuated after 
projection. H can be determined with respect to q by the following relation : 
H = tar ^ ; for g=5 degrees, we have H = 12 pixels. According to image size, we 
use P=8, H—Yl or H = 16 (exact divisors). Figure 3 illustrates the projection 
transform. 
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Fig. 3. Image from the “lost world” movie and projected image P(x, y) for 7 consecutive 
frames. The projection height H is 16. The scratches (dark one on the left side, bright 
one and dark one in the middle) are still visible. 



3 Line Scratch Candidates Selection 

The next step is the extraction of candidates which are used as input in the 
tracking process. The typical spatial signature for a scratch is a local extremum 
of the intensity curve along the x axis. So pixels candidate should be local max- 
ima or minima horizontally, to find bright or dark scratches respectively. Many 
different methods exist in the literature to achieve this detection, and we ex- 
perimented several ones [12]. For this work, we want our candidate extractor to 
meet the following requirements : 

— Generate signed output : positive for bright scratches, negative for dark ones. 

— Give a quality measure for each candidate, not only a simple binary output. 

— Normalise quality measure between some known bounds, typically {— 1, +1}. 

The method we use relies on greyscale morphology [13,14]. Candidates for a 
line scratch are extracted by computing the difference between the original im- 
age and its opening or closing with a structuring element B w . The opening will 
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Fig. 4. The image P(x,y) (left side) is computed from a 36 frames (1.5 second) sequence. 
The output Q(x,y) (right side) is here shown as greyscale image ; real output values 
are signed so the tracker cannot confuse bright scratch candidates and dark ones. 



remove thin structures brighter than the background, while closing will remove 
thin structures darker than the background. This way, to extract bright candi- 
dates, we subtract from P its opening with B w , and symmetrically candidates 
for dark scratches are defined as the difference between the pixel values in P and 
their closing with B w : 

D + {x , y ) = P(x, y) - (( P(x , y) 0 B w ) © B w ) . (2) 

D~(x, y) = ((P(x, y) © B w ) © B w ) — P(x, y) . (3) 

— D ± (x, y) is the difference between the greyscale pixel value being considered, 
and a spatial neighbourhood of width w. 

— ® stands for morphological dilatation, © for morphological erosion and B w 
is an unconstrained 1-D structuring element, of width w. 

Because line scratches can be poorly contrasted relatively to their back- 
ground, whereas natural image structures generally show a much stronger re- 
sponse, we locally normalise the result, to consider the significance of the ex- 
tremum with respect to its spatial neighbourhood, using the following formula : 




Detection and Tracking Scheme for Line Scratch Removal 269 



if (((P(x,y)®B w )-(P(x,y)QB w ))> s) 

D + (x,y) - D~(x,y ) 



Q(x,y) = A x 



else Q(x,y)=0 



(P(x, y ) ® B w ) - (P(x, y) © B w ) 



— Q(x,y) stands for the output image of this detector. This image is signed ; 
positive values standing for local maxima and negative values for local min- 
ima. The tracking stage will use this image as input. 

— (P(x, y) © B w ) — (P(x, y) © B w ) is the local contrast. 

— s is a threshold (see below). 

— A is a scaling factor, which determines output values range [—A... + A]. We 
typically use A = 127, to store Q(x,y) as an 8-bit greyscale image 

The major tuning parameters are w and s. w defines the maximum scratch 
width, and the size of the neighbourhood used to normalise output. It strongly 
depends on the input image resolution. We obtained satisfactory results with 
5 < w < 9 at video resolution (720 x 576), and 9 < w < 13 at high resolution 
(2048 x 1536). The threshold 5 has two goals : reduce the amount of false alarms, 
and inhibit candidate extraction in smooth areas. It controls the sensitivity and is 
usually set to some low value, but still required to eliminate spurious candidates. 
Figure 4 shows a result of a candidate detection. 



4 Tracking Problem Formulation 

4.1 General Background 

After the pre-processing and detection stage, we still have to track the scratch 
candidates over the sequence. A human observer will easily locate the most 
visible scratches in figure 4, but the visual localisation of incomplete ones requires 
more concentration. An automated tracking system should be fooled by false 
alarms too, especially if vertical structures are present in the image. 

The proposed tracking scheme should be able to distinguish real scratches 
from false alarms, to close the gaps caused by detection failures or discontinu- 
ous scratches, to find the optimum path through candidates and also uniquely 
identify the scratches (the detection process will assign an unique ID to each 
scratch, ranging from the frame where it appears to the frame where it vanishes. 

The input of our tracking scheme is the image Q(x,y). The lines are read 
sequentially, from the top to the bottom, so that the temporal axis matches 
the y axis of this image. Each line is a data set Z t for the tracker (observa- 
tion). In fact, such a representation (figure 4, right side) is very similar to a 
radar (or sonar) plot (echoes vs. time). Tracking schemes for such applications 
are common in the literature, and several approaches exist : Kalman filtering, 
AR methods, probabilistic data association filter (PDAF), multiple hypothesis 
trackers (MHT), Monte-Carlo and particle filters, . . . ; see an overview in [15]. 
Our problem is even simpler, since only one parameter should be estimated : 
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Fig. 5. Basis structure of trellis diagram built for each track. Since the representation 
space is matched against the state space, the possible transitions from state X t -i to 
Xt are linked to a one-pixel deviation from the x position attached to state Xt-i. 



the scratch localisation on the x-axis. Therefore, the state space and the world 
model (or representation space) are tightly matched. 

Kalman filtering or PDAF approaches combine the different hypothesis at 
each step, while the MHT multiple hypothesis scheme keeps multiples hypothesis 
alive [16]. The idea is that by getting more observations Z tl received at time £, 
and matching these to the hypotheses, the hypothesis corresponding to a real 
scratch path will gain more evidence, making it distinguishable from false ones. 
Besides, the hypothesis for a not perfectly continuous scratch will not disappear 
too quickly. Another advantage of such an approach is to unify in one concept 
path initialisation, path tracking and path decay. 

Of course, the challenge is to maintain a reasonable number of hypothesis, 
according to available memory and computing power. Rejecting unprobable hy- 
pothesis or keeping the best one at each stage are possible approaches. Since the 
whole sequence could be digitised prior to the processing, an exhaustive search 
of the optimum path for each scratch is also possible (although not reasonable) ; 
but our implementation heads to process data on the fly and therefore can be 
used in near-real time systems, gathering the images as outputted by a telecine. 



4.2 Path Hypothesis Generation 

The path hypothesis generation consists in building a trellis diagram in the state 
space. In this trellis diagram, we find possible states X t at time £, and branches 
representing transitions from one state at time t to the next at time t + 1. 
The tracking process can follow simultaneously multiple scratches, but to keep 
the algorithm simple we will use a different trellis for each track, and consider 
simultaneous tracks as independent (accordingly, track merging is impossible). 
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A particular path hypothesis through the trellis is defined as a set of sequen- 
tial states X t o, X t i f ..., X t _i, X t , tO being the starting time (initialisation) for 
the track we consider. Sequentially, as shown on figure 5, each state at time t is 
linked with only 3 states at time t — 1, and conversely. This is due to the fact 
that we tolerate only, after projection, a horizontal displacement of one pixel 
between two consecutive lines for a track. As a consequence, the total number 
of possible paths for a particular track, at time £, is t 3 . 

A new trellis for a new track (holding one state X t o = x and one path 
hypothesis) is generated if, for a given data set Z t (a line taken from the image 
Q(x,y)), an unmatched, isolated but relevant candidate is found. 



4.3 Path Hypothesis Update 

As stated earlier, the challenge is to prune the hypothesis tree, which grows up 
even if the new data set does not contain high detection values. Since the path 
hypothesis are represented as a trellis diagram in the state space, a practicable 
approach consist in weighting each transition from a state at stage m — 1 to a 
state at stage m. The well-known Viterbi algorithm can be used to sequentially 
prune the paths. All paths kept at the previous state are extended to the possible 
states at the next stage, and the best path leading to each state is selected. Other 
paths are eliminated after further consideration. The Viterbi algorithm can be 
used in its standard form when the transition costs between states depend only 
of the previous state and the current measurements. If this condition is not met 
(non-markovian) then the kept path might be sub-optimum. 





= -73.4 =148.64 = 204.24 

E4= 426.6 E4= 648.64 £4 = 704.24 



Fig. 6. These figures illustrate a possible path trough the state space, and therefore 
the representation space Q(x,y) (see text). Each line of Q(x,y ) is used as observation 
Z t for the tracker. To prune the paths and keep the L-best ones, the likelihood of a 
track is measured by a cost or energy function, based on both the quality of candidates 
(left figure) and closeness to an estimated path using short term history for parameter 
estimation (dotted line in the right figure) . The cost computation is done for each path 
in the L-list, and for each possible new state. 
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To overcome this behaviour, we use the list-Viterbi algorithm (L-Viterbi) 
keeping a list of the L most valuable paths at each state and for each stage [17, 
18] We will sequentially prune paths which are unlikely, and choose L so that no 
valid path is eliminated. The risk that an optimum path is rejected is alleviated 
as L increases. Like in the Viterbi algorithm, a value is computed along a path 
(cost function if the value should be minimised, else energy function). While the 
real Viterbi algorithm use plausibility value to score the transitions from state 
to state, we give below details on our implementation. 

The set of possible paths at time t is noted C*, with Ct,i the i th possible path 
at this time. At each branch in the trellis, we assign an cost or energy function 
E t j which depends on the path C t ,i being considered, and for each path C t ,i we 
defined E(C t ,i) as the cumulative energy of its path branches : 

t 

E{C t>i ) = Y J En,i. (4) 

n=0 

Our tracking problem can now be summarized as finding the L-optimal paths 
through this trellis, maximising the energy function E. For the path C t ,i the 
energy assigned to the branch linking the state X t _i^ to X t ^ is : 

Et,i = \Q{Xt,i)\ - W * (X tti - Xt,i) 2 ■ ( 5 ) 

The first term Q(X t ^) is the quality criteria of the candidate associated with 
the state X t ^. Using it as part of the cost function is quite obvious, since a line 
scratch is defined as set of sequential local extrema (candidates) extracted from 
P(x,y). So a path should maximise the amount of candidates helding a strong 
quality criteria. The sign of Q(x,y) is used as toggle to prevent mixing “bright” 
candidates and “dark” candidates, but \Q(x,y)\ is used in the energy function. 

The second term (X t ^—X t ^) 2 is the squared difference between X t ^ and X t ,i> 
an estimate using the state history on path Ctj. This is a tension constraint, a 
basic line scratch model, which will prevent paths which are not rigid enough 
to be chosen. The physical behaviour (inertia) of line scratches is reflected by 
this model. This constraint will prevent a path from locking on isolated extrema, 
especially when no more valid candidates seems available. The model used for the 
X(t,i) estimate is a 2-order polynomial, which is enough for our requirements : 

2 

x(t) = 52 m ®) • ( 6 ) 

i= 0 

We estimate the polynomial coefficients using a least square method, on N 
previous states : X t _ n+ i^, ..., N must be high enough to prevent 

model divergence, and low enough to fit well the local trajectory. Kalman filter 
was taken in consideration for this task in earlier work [19]. 

Finally, W is a scaling factor, used to control the respective influence of both 
contributions (and so the rigidity of estimated tracks). It is strongly dependant 
of the parameter A used in the pre-processing stage. We obtained good results 
with W = and choosing N according to the projected image height. 
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4.4 Track Ending Condition 

As the update process handles the hypothesis tree pruning, the tracking mecha- 
nism is kept running until the track’s ending condition is reached. If we introduce 
predominant history-related factors in the ending condition computation, long 
scratches will be kept alive while the ending condition is quickly reached for short 
ones. So, only the quality values of candidates are used. The mean value of the 
quality values Q(x, y) associated to the states X t . . . X t _^ for the best path will 
be computed. The tracking is suspended if this value falls below a threshold ; we 
suppose the end of the scratch is reached. The path is stored in an intermediate 
data file for the subsequent removal processing, and the trellis representation is 
cleared from memory. This ending condition induces the tracking to overshoot 
the real scratch end. It could be improved by searching the strongest negative 
variation of Q(x,t) along the best path if the ending condition is met. 



4.5 General Algorithm 

for each observation Z_t 

for each candidate in Z_t with non-zero quality value Q(x,y) 
initialise a new track 
end for 

for each track 

for each path hypothesis (from the L-list at t-1) related to this track 
estimate model parameter using N states along this path 
extend model for time t 

for each new state X_t reachable from state X_t-1 
compute cost for the transition X_t-1 to X_t 
add new transition cost to overall path cost 
end for 
end for 

sort and keep L-best paths, clear other paths from memory 
if end condition for the best path is met 
store track and clear memory 
end if 
end for 
end for 



5 Evaluation and Conclusion 

Theis contribution is focused on the detection side of the scratch removal pro- 
cess. The described detection and tracking scheme feeds the subsequent correc- 
tion process with scratch trajectories ; this later process could rely on several 

approaches : interpolation, in-painting, Showing a result for the complete 

removal process on the basis of a single image is irrelevant in printed form, since 
the restoration quality could only be assessed by dynamic rendering 2 . Generally, 
we do not have film samples before degradation, and working with synthetic data 

2 See video clips attached to the electronic version of this paper 
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Fig. 7. Left : the image Q(x,y) for 27 consecutive frames. This sequence shows many 
vertical structures (vertical curtain folds in the background of the scene) beside a real 
scratch, middle : Tracking results. Right : Tracking result keeping the longest paths 



is nonsense (simulated scratches : what model to use), so a objective efficiency 
measurement is difficult. And because the goal of restoration is to improve the 
visual quality of degraded film sequences, the appropriate evaluation method is 
by subjective evaluation. 

Even if we still find really weird images (for ex. with a lot of vertical struc- 
tures, similar to the figure 7) overthrowing our algorithm, the overall efficiency 
of this detection scheme has been proved, and performs better than the previ- 
ous ones or other known methods, for jittering scratches as well as steady ones. 
At present, we are improving the whole scratch removal process, especially the 
correction step by limiting the repetitive over-corrections.. 

At last, this tracking concept is used in our restoration software suite called 
RETOUCHE, used by the French post-production group Centrimage and by 
the CNC (French national film archives). RETOUCHE has been used for the 
digital restoration of 3 full-length features in 2K resolution and one video doc- 
umentary (ca. 500000 frames) with convincing results. The algorithm and its 
implementation are also fast (less than a second per frame for 2 K images). 



Acknowledgements. Images from “The Lost World” (1925) by courtesy of 
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the Cinematheque de Bretagne. 
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Abstract. The human visual system is able to correctly determine the 
color of objects in view irrespective of the illuminant. This ability to 
compute color constant descriptors is known as color constancy. We have 
developed a parallel algorithm for color constancy. This algorithm is 
based on the computation of local space average color using a grid of 
processing elements. We have one processing element per image pixel. 
Each processing element has access to the data stored in neighboring 
elements. Local space average color is used to shift the color of the input 
pixel in the direction of the gray vector. The computations are executed 
inside the unit color cube. The color of the input pixel as well as local 
space average color is simply a vector inside this Euclidean space. We 
compute the component of local space average color which is orthogonal 
to the gray vector. This component is subtracted from the color of the 
input pixel to compute a color corrected image. Before performing the 
color correction step we can also normalize both colors. In this case, the 
resulting color is rescaled to the original intensity of the input color such 
that the image brightness remains unchanged. 



1 Motivation 

The human visual system is able to correctly determine the color of objects 
irrespective of the light which illuminates the objects. For instance, if we are 
in a room illuminated with yellow lights, we are nevertheless able to determine 
the correct color of the objects inside the room. If the room has a white wall, 
it will reflect more red and green light compared to the light reflected in the 
blue spectrum. Still, we are able to determine that the color of the wall is white. 
However, if we take a photograph of the wall, it will look yellow. This occurs 
because a camera measures the light reflected by the object. The light reflected 
by the object can be approximated as being proportional to the amount of light 
illuminating the object and the reflectance of the object for any given wavelength. 
The reflectance of the object specifies the percentage of the incident light which 
is reflected by the object’s surface. The human visual system is somehow able to 
discount the illuminant and to calculate color constant descriptors if the scene 
is sufficiently complex. This ability is called color constancy. In other words, the 
human visual system is able to estimate the actual reflectance, i.e. the color of 
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the object. The perceived color stays constant irrespective of the illuminant used 
to illuminate the scene. 

Land, a pioneer in color constancy research, has developed the retinex the- 
ory [1,2]. Others have added to this research and have proposed variants of the 
retinex theory [3, 4, 5, 6, 7, 8]. Algorithms for color constancy include gamut-con- 
straint methods [9,10,11], perspective color constancy [12], color by correlation 
[13,14], the gray world assumption [15,16], recovery of basis function coefficients 
[17,18,19], mechanisms of light adaptation coupled with eye movements [20], neu- 
ral networks [21,22,23,24,25,26], minimization of an energy function [27], com- 
prehensive color normalization [28], committee-based methods which combine 
the output of several different color constancy algorithms [29] or use of genetic 
programming [30]. Risson [31] describes a method to determine the illuminant by 
image segmentation and filtering of regions which do not agree with an assumed 
color model. 

We have developed a parallel algorithm for color constancy [32,33,34]. The 
algorithm computes local space average color using a parallel grid of processing 
elements. Note that local space average color is not the same as global space 
average color. Global space average color assumes a single illuminant. In contrast, 
we do not assume a uniform illumination of the scene. Local space average color 
is taken as an estimate of the illuminant for each image pixel. This estimate 
of the illuminant is then used to perform a local color correction step for each 
image pixel. 



cyan 




white 



yellow 



Fig. l. RGB color space. The space is defined by the three vectors r (red), g (green), 
and b (blue). The gray vector w passes through the center of the cube from black to 
white. 
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2 RGB Color Space 

Let us visualize the space of possible colors as a color cube [35,36] and let r = 
[1,0, 0] T , g = [0,1, 0] T , b = [0,0, 1] T be the three color vectors red, green and 
blue, which define the cube. The color components are normalized to the range 
[0, 1]. Therefore, all colors are located inside the unit cube. The eight corners of 
the cube can be labeled with the colors black, red, green, blue, magenta, cyan, 
yellow, and white. The gray vector passes through the cube from [0,0, 0] T to 
[1, 1, 1] T . This RGB color space is shown in Figure 1. 




Fig. 2. Each processing element has access to information stored at neighboring pro- 
cessing elements (left). A matrix of processing elements with one processing element 
per pixel is used (right). 



3 Parallel Computation of Local Space Average Color 

The algorithm operates on a grid of processing elements. Each processing element 
has access to the color of a single image pixel. It also has access to data stored 
and computed by four neighboring processing elements (Figure 2). We have one 
processing element per image pixel. The algorithm first determines local space 
average color. Local space average color is calculated iteratively by averaging 
estimates of the local space average color from neighboring elements. Let ai(x, y) 
with i G {r, g,b} be the current estimate of the local space average color of 
channel i at position (x,y) in the image. Let Ci(x,y) be the intensity of channel 
i at position (x, y) in the image. Let p be a small percentage greater than zero. 
We iterate the following two steps indefinitely. 



1. ) a'i(x,y) = (a,i(x - 1 ,y) + at(x + 1 ,y) + at(x,y - 1) + ai(x,y + l))/4.0 

2. ) cn(x, y) = a(x, y)-p + o-( x, y) • (1 - p) 
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The first step averages the estimate obtained from the neighboring elements on 
the left and right, as well as above and below. The second step slowly fades the 
color of the current pixel q(x, y) into the current average. As a result, we obtain 
local space average color for each pixel of the image. Note that initialization 
of di can be arbitrary as it decays over time due to the multiplication by the 
factor (1 —p). Figure 3 shows how local space average color is computed with 
this method for an input image. Since the intensities are averaged for each color 
channel, it is important that the input data is linear. If necessary, it must be 
linearized by applying a gamma correction. 




step 1 



step 50 



step 200 




step 1000 step 5000 step 20000 



Fig. 3. Local space average color is computed iteratively. The images show local space 
average color after 1, 50, 200, 1000, 5000, and 20000 steps. For this image 22893 steps 
were needed until convergence. 



As the estimate of local space average color is handed from one element to 
the next, it is multiplied with the factor (1— p). Therefore, a pixel located n steps 
from the current pixel will only contribute with a factor of (1 — p) n to the new 
estimate of local space average color. The above computation is equivalent to 
the convolution of the input image with the function where r = y/x 2 + y 2 

is the distance from the current pixel and cr is a scaling factor. In this case, local 
space average color a is given by 



a = k 



[ ce- 

Jx,y 



. M 



dxdy 



(i) 



where k is chosen such that 
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Thus, the factor p essentially determines the radius over which local space av- 
erage color is computed. Figure 4 shows the result for different values of p. In 
practice, this type of computation can be performed using a resistive grid [7,25]. 




p=0.001 p=0.0005 p=0.0001 



Fig. 4. The parameter p determines the extent over which local space average color 
will be computed. If p is large, then local space average color will be computed for a 
small area. If p is small, then local space average color will be computed for a large 
area. 



Alternatively, instead of using a convolution with e~^ to compute local 
space average color, one can also compute space average color by convolving the 
input image with a Gaussian. In this case, local space average color is given by 

a = k I ce~^dxdy (3) 

Jx,y 



where k is chosen such that 



k e ^ dxdy = 1. (4) 

Jx,y 

This type of convolution is used by Rahman et al. [8] to perform color correction 
using several Gaussians of varying extent for each image pixel. 

4 Color Constancy Using Color Shifts 

According to the gray world hypothesis, on average, the color of the world is gray 
[15,16]. If space average color deviates from gray, it has to be corrected. Instead 
of rescaling the color channels, we use local space average color to shift the color 
vector in the direction of the gray vector. The distance between space average 
color and the gray vector determines how much local space average color deviates 
from the assumption that, on average, the world is gray. Let w = -^[1, 1, 1] T be 

the normalized gray vector. Let c = [c r , c p , q,] t be the color of the current pixel 
and a = [a r ,a^,a&] T be local space average color. We first project local space 
average color onto the gray vector. The result is subtracted from local space 
average color. This gives us a vector which is orthogonal to the gray vector. It 
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points from the gray vector to the local space average color. Its length is equal 
to the distance between local space average color and the gray vector. 

a_L = a — (a • w)w (5) 

This vector is subtracted from the color of the current pixel in order to undo the 
color change due to the illuminant. The output color is calculated as 

o = c a (6) 

or 

°i — c i — a i + + Q>b) (7) 

where i E {r, g , b}. If we define a = \{a r + a g + a^), we have 

Oi = Ci-CLi + a. ( 8 ) 

This operation is visualized in Figure 5 for two vectors c and a. 




Fig. 5. First, the vector a is projected onto the white vector w. The projection is 
subtracted from a which gives us a^, the component perpendicular to w. This vector 
is subtracted from c to obtain a color corrected image. 



The entire algorithm can be realized easily in hardware. The averaging oper- 
ation can be realized using a resistive grid [7,25]. See Koosh [37] for a realization 
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of analog computations in VLSI. We only require local connections. Therefore, 
the algorithm is scalable to arbitrary image sizes. 

Instead of subtracting the component of space average color which is orthog- 
onal to the white vector, one can also normalize both colors first 



1 



a = 



[ct r , CLg , CLb\ 



CL r CLg ~\~ CLb 

A _ 1 „ „ 1 T 

c — J J [ c r 1 Cgi Q>] 

C r ~\~ Cg H - Cb 



(9) 

( 10 ) 



In this case, both space average color a and the color of the current pixel c are 
projected onto to the HSI plane r + g + b = 1 [35,38]. We again calculate the 
component of a which is orthogonal to the vector w. 



a.± = a — (a • w)w 



(ii) 



This component is subtracted from the normalized color vector c 



6 = c a (12) 

or 

o i = c i -a i + i. (13) 

The normalized output is then scaled using the illuminance component of the 
original pixel color. 



Oi = (c 


r + c g V Cb)di 


1 

CO | 




(14) 


= c i 


(c r Cg T C b ) {(Ii 




(15) 


= Ci 


c r T Cg T Cb y 


1, 

\CL r T CLg 


+ O'b)) 


(16) 




CLr V Cjg H - CLb 


3 y 


= Ci 


- 3 (oi - a) 
a 






(17) 


calculations needed 


for this algorithm 


are shown in Figure 6. 





5 Results 

Both algorithms were tested on a series of images where different lighting con- 
ditions were used. Results for both algorithms are shown for an office scene in 
Figure 7. The images in the top row show the input image. The images in the 
second row show local space average color which was computed for each input 
image. The images in the third row show the output images of the first algorithm. 
In this case, the component of local space average color which is orthogonal to 
the gray vector was subtracted from the color of the input pixel. The images 
in the last row show the output images of the second algorithm. In this case, 
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Fig. 6. First, the vectors c and a are projected onto the plane r + g + b = 1. The 
corresponding points are c and a respectively. Next, the vector a is projected onto the 
white vector w. The projection is subtracted from a which gives us ai, the component 
perpendicular to w. This vector is subtracted from c and finally scaled back to the 
original intensity to obtain a color corrected image. 



local space average color and the color of the input pixel was normalized before 
performing the color correction step. 

The input images were taken with a standard SLR camera. A CD-ROM with 
the images was produced when the film was developed. All three images show 
a desk with some utensils. The first image is very blue due to blue curtains 
which where closed at the time the image was taken. Sun was shining through 
the curtains which produced the blue background illumination. For the second 
input image a yellow light bulb was used to illuminate the room. Finally, the 
desk lamp was switched on for the third image. Again, the blue background 
illumination is caused by sunlight shining through the blue curtains. Note that 
the output images are much closer to what a human observed would expect. 

6 Discussion 

In the following we discuss differences to other color constancy algorithms which 
are related to the algorithm described in this contribution. The algorithms of 
Horn [6,7], Land [2], Moore et al. [25] and Rahman et al. [8] do not accurately 
reproduce human color perception. Helson [39] performed an extensive study 
with human subjects. The subjects task was to name the perceived color of gray 
stimuli which were illuminated with colored light. The stimuli were placed on a 
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Input Images: 




Local Space Average Color: 




Results for Algorithm 1: 




Results for Algorithm 2: 




Fig. 7 . Experimental results for an office scene. The first row of images show the input 
images. The second row of images show local space average color. The third row shows 
the output images which were computed by subtracting the component of local space 
average color which is orthogonal to the gray vector from the color of the input pixel. 
The last row shows the results when local space average color and the color of the input 
pixel were normalized before performing the color correction step. 



gray background. Helsons drew the following conclusions from his experiments. 
If the stimuli has a higher reflectance than the background, then the stimuli 
seems to have the color of the illuminant. If the stimuli has the same reflectance 
as the background then the stimuli is achromatic. If the stimuli has a lower 
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reflectance than the background then the subject perceives the stimuli as having 
the complementary color of the illuminant. The algorithms of Horn [6,7], Land 
[2], Moore et al. [25], and Rahman et al. [8] do not show this behavior. If the 
fraction between the color of the pixel and local space average color is computed 
then the color of the illuminant falls out of the equation and the stimuli will 
always appear to be achromatic. 

All of the above methods require a normalization step which brings the out- 
put to the range [0,1]. This normalization step can either be performed inde- 
pendently for each color band or the normalization can be performed uniformly 
across all color bands. In any case, one needs to loop over all pixels of the image. 
The algorithm which is described in this contribution does not require such a 
normalization step. All of the computations are performed inside the color cube. 
Values outside this color cube are clipped to the border of the cube. 

In our previous work [32,34] we already discussed in depth the computation 
of local space average color using a grid of processing elements. Previously, we 
divided the color of the input pixel by the local space average color. This is 
exactly the gray world assumption [15,16] applied locally. This algorithm does 
not show the behavior described by Helson [39]. In [33] we subtract local space 
average color from the color of the input pixel followed by a rescaling operation. 
This method also does not correspond to human color perception described by 
Helson. 

The algorithms described in this contribution performs color correction inside 
the unit color cube. The color cube is viewed as an Euclidian space. The color of 
the pixel is shifted in a direction perpendicular to the gray vector. The extent of 
the color shift is computed using local space average color. The first of the two 
algorithms which are described in this contribution shows the same response as 
a human observer for similar stimuli as was used in Helson’ s experiments. 



7 Conclusion 

In light of the current transition away from analog cameras towards digital cam- 
eras it is now possible to post-process the digital images before development 
to achieve accurate reproduction of the scene viewed. Such post-processing can 
either be done by the CPU of the camera or by post-processing the images on 
external hardware before the images are printed. Accurate color reproduction 
is very important for automatic object recognition. However, one of the largest 
markets will probably be consumer photography. 

We have developed an algorithm for color constancy. The method consists of 
two parts: (a) a parallel grid of processing elements which is used to compute 
local space average color and (b) a method to estimate the original colors of the 
viewed objects. Instead of rescaling the red, green, and blue intensities using the 
inverse of local space average color, we shift the color vector into the direction of 
the gray vector. The color shift is based on local space average color. Therefore, 
the algorithm can also be used in the presence of varying illumination. 
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Abstract. In this paper, the gradient vector flow fields are introduced in the 
image anisotropic diffusion, and the shock filter, mean curvature flow and 
Perona-Malik equation are reformulated respectively in the context of this flow 
fields. Many advantages over the original models can be obtained, such as 
numerical stability, a large capture range, and computational simplification etc. 
In addition, the fairing process is introduced in the anisotropic diffusion, which 
contains the fourth order derivative and is reformulated as the intrinsic 
Laplacian of curvature under the level set framework. By this fairing process, 
the boundaries of shape will become more outstanding. In order to overcome 
numerical errors, the intrinsic Laplacian of curvature is computed from the 
gradient vector flow fields, but not directly from the observed images. 



1 Introduction 

The image anisotropic diffusion is to smooth the image in the direction of an edge, 
but not perpendicular to it, so that the location and strength of edges can be 
maintained. Since Perona-Malik equation was presented as an anisotropic diffusion 
model in [1], there have been extensive literatures that presented a lot of the 
anisotropic diffusion models and offered the numerical schemes to obtain steady state 
solution [2-10]. In this paper, we would pay attention to the following three classic 
anisotropic diffusion models, shock filter, mean curvature flow scheme, and Perona- 
Malik equation. 

The shock filter scheme was presented in [12] as a stable deblurring algorithm 
approximating deconvolution. It is well known that it is extremely sensitive to noise. 
Many further researches almost focus on how to define more precise and robust 
coefficient function so as to smooth noise while preserve shape and geometric 
features. The frequent idea is to add some kind of anisotropic diffusion term with a 
weight between the shock and the diffusion processes. In [2], a combination form to 
couple shock with a diffusion term was proposed, I t = -sign(G a * I nn )|V/| + cl g , 

where c is a positive scale, rj is the direction of gradient and f is the direction 
perpendicular to the gradient. In [4], a complex diffusion model was presented, 
I t = -(2/^)arctan(^Im(//^))|V/| + a l I m +a 2 I ^ , where the first term is shock term, 

a is a parameter that controls the sharpness, a x = re 6 is a complex scale, a 2 is a real 
scale. 

T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3023, pp. 288-301, 2004. 
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The mean curvature flow model was presented in [16,8] as an edge enhancement 
algorithm in the presence of noise. In [10], the mean curvature flow was applied to 
enhancement and denoising under the Min/Max flow scheme. In the above 
applications, only the pure curvature flow was used. While a deconvolution model 
was introduced in the mean curvature flow model in [7] for deblurring and denosing. 

For the Perona-Malik equation, a frequent idea is to design the coefficient 
functions directly so as to weight the terms I nT1 and Ig adaptively. In [5], the 

coefficient function was defined as, g(x) = (l + (x/ k x ) n ) - a(l + ((x - k 2 )/ wf m ) , 

where k ] is a threshold that is the limit of gradients to be smoothed out, k 2 and w are 
threshold and range that control the inverse diffusion process. In [6], the general form 
of Perona-Malik equation was presented, I t =c[aI nT] + big ) , where the parameters 

a,b,c are defined respectively. And the two eigenvalues of its Hessian Matrix are 
instead of the second order derivatives I 1V1 and Ig in order to suppress the influence of 
noise. From a numerical view, these mentioned algorithms are too complicated, and 
over many parameters need to be determined. 

In this paper, the Gradient Vector Flow fields [1 1] are introduced in the anisotropic 
diffusion. Since this flow fields can be determined in advance, (i.e. they are invariable 
during image diffusion), and besides, provide a large capture range to the object 
boundaries, they perform well on noise or spurious edges. Another particular 
advantage is to simplify computation. We will demonstrate these advantages by 
applying the gradient vector flow fields to the shock filter, the mean curvature flow 
and the Perona-Malik equation respectively. Since many proposed anisotropic 
diffusion models usually regarded these three models as their basic prototypes. 
Besides that, in order to make the enhanced boundaries vivid, the fourth order flow 
model of the plane curve [13] is introduced in the anisotropic diffusion. Because of 
the gradient vector flow fields, the computation for the fourth order flow will become 
simple and reliable. 

This paper is organized as follows: in section 2, the gradient vector flow fields are 
briefly introduced. In section 3, the shock filter, mean curvature flow, Perona-Malik 
model and fourth order flow model based on the gradient vector flow fields are 
presented respectively. The implementation details of our presented models and 
experimental results are shown in section 4. Finally, our conclusions and ideas for 
future research appear in section 5. 



2 Gradient Vector Flow (GVF) Fields 

The gradient vector flow (GVF) field was firstly presented for the active contour 
models, in which the GVF field was used as an external force in [11]. Because the 
GVF fields are computed as a diffusion of the intensity gradient vectors, i.e. the GVF 
is estimated directly from the continuous gradient vector space, and its measurement 
is contextual and not equivalent with the distance from the closest point. Thus, the 
noise can be suppressed. Besides that, the GVF provides a bi-directional force field 
that can capture the object boundaries from either side without any prior knowledge 
about whether to shrink or expand toward the object boundaries. Hence, the GVF 
fields can provide a large capture range to the object boundaries. 
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Fig. 1 . GVF field corresponding to the rectangle on right image 



First, a Gaussian edge detector (zero mean, with a E variance) is used in the edge 

1 f |V(G */)(x)| 2 ) . 

map defining [14], f(x) = 1 — == — exp — ,xe R 2 . In fact, if only 

V2 no F 2a* 

E V e y 

the boundary information (be usually the intensity gradient) is taken into account, the 
edge map fix) can be defined as other forms too [11]. The GVF field v(x,t) is 
defined as the equilibrium solution to the following vector diffusion equation, 

| v(x,t), = /J5/ 2 \(x,t) - /(x)(v(x,r) - V/(X)|V/(X)| 2 
|v(x,0) = V/(x) 

where ju is a blending parameter. We can note that the flow vectors of this obtained 
vector fields v(x,f) always point to the closest object boundaries, i.e. the vector 

v(x, t) indicates the correct evolving direction of the curvature flow all the time, but 
not the gradient direction. In Fig.l, the vectors v(x,f) in the GVF fields always point 

to the closest boundaries of object. It is clear that a large capture range to the desired 
edges is achieved through a diffusion process of the intensity gradient vectors that 
doesn’t smooth the edges themselves. In the following proposed anisotropic diffusion 
models, the GVF fields are invariable during the image diffusion. They can be 
determined in advance. Thus, this will decrease the influence of noise and improve 
the numerical stability for image evolution. 

In order to reveal its intrinsic properties, let’s expand V/(|V/|), 

v(x,0 = V/(|V/(x)|)= A 1 V|V/(x)|, fovt > 0 



where, L = J — exp -- — — >0. For the convenience, the Gaussian operator 

a/2 xal [ 2(7 2 e J 

G g in fix) is omitted. It will lead to the different geometry interpretations to deform 
Eq.(l) by the different means in the following section. 






Image Anisotropic Diffusion Based on Gradient Vector Flow Fields 



291 



3 Diffusion Models Based on GVF Fields 



3.1 Shock Filter 



The heat equation will result in a smoothing process, while the inverse heat equation 
will lead to a deblurring process to approximate deconvolution. But the inverse heat 
equation is extremely ill-posed. The shock filter tries to get as close as possible to the 
inverse heat equation to reach a stable solution. It is formulated as, 
I t = -signal m )|V/| , where rj is the direction of the gradient. 

In this section, the GVF fields are introduced in designing shock filter. The either 
side of Eq.(l) is dot-multiplied by the intensity normal vector, N = V//|V/| , 
respectively, as follows, 



v • N = 4 V|V/| • N 




( 2 ) 



where D 2 I denotes the Hessian of intensity I. Obviously, the above equation is equal 
to the second derivative of I in the direction of the intensity gradient, I, vv up to a 
positive scale, X v So, the shock filter equation can be reformulated as, 

7, = -sign(\ • N)|V/| . (3) 



When the GVF and the normal vector have the same direction, the current position is 
not at edges. On the other hand, when these vectors have opposite directions, the 
current position should be at boundaries. So, Eq.(3) lets the image develop true edges. 
The worse case is when the GVF is tangential to the intensity normal. Obviously, no 
diffusion takes place. From the implementation view, the Eq.(3) simplifies the 
computation of I nT1 . However, as a matter of fact, the original shock filter scheme is 
extremely sensitive to noise because of lack of the diffusion processes (see [4] for 
details). While the term (v • N) in Eq.(3) is only a second derivative of I in the 

direction of the intensity gradient but not diffusion term, it can not remove noise. 
Thus the noise sensitive problem still exists in Eq.(3) as in the original scheme. 



3.2 Mean Curvature Flow Equation 

The mean curvature flow equation is only one of the anisotropic diffusion models. 
The key idea is that an image is interpreted as a collection of iso-intensity contours 
which can be evolved. Usually its standard form can be formulated as l t - at|V/| , 

where is the curvature of intensity contours, k = V • (V//|V/|) . It has received a lot 

of attention because of its geometrical interpretation: the level sets of the solution 
move in the normal direction with a speed proportional to their mean curvature. Many 
theoretical aspects of this evolution equation, such as the theory of weak solutions 
based upon the so-called viscosity solution theory, have been summarized in [15]. In 
image nonlinear diffusion applications, it had been proved that the curvature flow 
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equation was well-posed, and the curvature flow was used as image selective 
smoothing filter in [16,8]. However, according to the Grayson’s theorem, we know 
that all the structures would eventually be removed through continued application of 
the curvature flow scheme. In order to preserve the essential structures while remove 
noise, the Max/Min flow framework based on the curvature flow equation was 
proposed in [10]. In these above algorithms, only the pure mean curvature flow is 
used. Indeed, we could introduce some constraint terms in this mean curvature flow 
scheme just as in the active contour models. In this section, our starting point is to 
balance between the internal force which is from the curvature of evolution curve and 
external force. The GVF fields will provide the curvature flow scheme a new external 
force, which can overcome the noise or spurious edges effectively. 

Consider the Eq.(2). We know that (v • N) indicates the second derivative of I in 

the direction of gradient. Obviously, the sign of (v • N) will change along the normal 
to the boundaries in the neighborhood of boundaries even if the direction of gradient 
N doesn’t change. Thus, the GVF indicates a correct evolution direction of the 
curvature flow, but not the gradient direction. In our approach, the GVF is introduced 
as a new external force into the original curvature evolution equation directly from a 
force balance condition. According to Eq.(2), we can determine a contextual flow as, 

C t = (v • N)N , where, C e R 2 . The interpretation of this flow is clear. An important 
fact is that the propagation driven by the curvature flow always takes place in the 
inward normal direction (i.e. -N). Obviously, the optimal way to reach the boundaries 
is to move along the direction of GVF. Because of the noise or spurious edge, the 
gradient vector can’t always align to the GVF. Thus, the optimal propagation is 
obtained when the unit vector of v(x) and the inward normal direction are identical. 

On the other hand, the worse case occurs when v(x) is tangential to the normal, i.e. 
v 1 N . 

Under the level set framework, it is easy to introduce this contextual flow from the 
GVF fields into the curvature evolution equation, 

I t = r at|V/| - (1 - r)v • V/ , (4) 

where, 0< r <1. When the GVF and the inward normal have the same direction then 
the flow will be accelerated. On the other hand when these vectors have opposite 
directions, the flow will be weakened, even stopped. When the GVF is tangent to the 
normal then the curvature flow k will dominate the evolution process. Besides that, it 
is known that the strength of the new external force, (v • N) , can be adjusted by the 

parameter adaptively. For the homogeneous regions or boundaries, A, becomes so 
small as to weaken the external force from the GVF fields. On the contrary, it 
becomes too large near boundaries to ignore this new external force in Eq.(4). 

The proposed scheme (4) is similar to the model presented in [7], in which a model 
of convolution was introduced in the mean curvature flow scheme. Indeed, whether 
deconvolution or deblurring processes are all sensitive to noise, while the mean 
curvature term could make them well-posed (see [7] for details). 

From an implementation perspective, the scheme of Eq.(4) has a particular 
advantage over the original curvature evolution equation that since the GVF fields are 
invariable, the flow would not be evolved endlessly. When the internal and external 
force balance is reached, the flow evolution will be terminated at the object 
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boundaries. Furthermore, since the GVF fields provide a large capture range to the 
object boundaries, the flow would not fall in the noise points or spurious edges. Thus, 
the scheme of Eq.(4) is able to suppress noise effectively. 



3.3 Perona-Malik Equation 



The Perona-Malik equation (P-M Equation) introduced in [1] has been successful in 
dealing with the nonlinear diffusion problem in a wide range of images. The key idea 
is roughly to smooth out the irrelevant, homogeneous regions like the heat equation 
when |V/| is small and to enhance the boundaries instead like an inverse heat equation 
when |V/| is large. Such is the P-M Equation of the divergence form, 
I t = V • (g(|V/|)v/), where, g(-)>0 is a decreasing function of the gradient |V/|, and 

' vr 



7(7=0) = 7 0 . Usually let g(V/|)=exp 



2(7 2 



where o is a positive parameter that 



V J 

control the level of contrast of boundaries. The coefficient function g(|V/|) is close to 
1 for |V/|«<7, while g(|V/|) is close to 0 for |V/|»<7. A theoretical analysis shows that 
solutions of P-M equation can actually exhibit an inverse diffusion near boundaries, 
and enhance edges that have gradients greater than a. In this section, the GVF fields 
are introduced in the P-M Equation. As the GVF is determined in advance, it will 
weaken the influence of noise or spurious edges in inverse diffusion process. 

Let’s expand the P-M equation, I t = + gAI , where is the direction of 

gradient. In general, the first term is an inverse diffusion term for sharpening the 
boundaries while the second term is a Laplacian term for smoothing the regions that 
are relatively flat. In order to accord with the GVF fields, the coefficient function is 

defined as g(JV/|)= 1 - /(JV/|)= 1 — exp 

E V / 

Equation, we can note that it is lack of the gradient vector term V/ in Eq.(l). Consider 
the equation, - v- V/ = It is clear that the term (-v- V7) is the inverse 

diffusion term in P-M equation. Thus, the P-M equation is rewritten as 



V/ 
2cr 2 



. Comparing Eq.(l) with P-M 



I t = -v-V7 + gA7 , 



(5) 



where g(-)can be estimated directly using the gradient |V/|. The advantages of Eq.(5) 
over the traditional P-M equation are very distinct, 

• As only the gradient and Laplacian terms need to be computed directly from 
the observed images, the computation is simplified. 

• The worst case for the first term is that when the GVF is tangent to the 
gradient direction then this inverse diffusion term is equal to 0. In fact, it is 
noise or spurious edges that cause these vectors orthogonal. Thus the regions 
around these points should be smoothed, but not enhanced. 

• The scheme of Eq.(5) is an open framework. Under this scheme, the inverse 
diffusion term and Laplacian term can be easily controlled by redefining 
their coefficients respectively, g(-), g'(-) , according to smoothing effect. 
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Fig. 2. The comparison of the coefficients g(x), g\x), g(x) , plotted as the functions of 
gradient magnitude. (a E = l, a = 2) 



Consider the coefficient function g(x ) and its derivate g\x ) . By adjusting the 



parameter o E , we can control the smoothing effect. But, the coefficient of the inverse 
diffusion term can’t be adjusted freely (see Fig.2(a,b)). An intuitional method is to 
redefine the coefficient of the inverse diffusion term as a Gaussian form, 



g(x) = - 



£ ,(x) . eX p 


( 2 \ 

a - 2 ax 


X 


( x-a ) 2 


exp 
1 + x 1 


{ 2(7^ J 


— i — exp 

(1 + XW27T<J 3 e 





where, a is an inverse diffusion parameter. It is clear that the inverse diffusion can be 
controlled freely by changing the parameter a (see Fig. 2(c)). 

Obviously, the parameter a is adjusted for inverse diffusing, and is independent of 
<J E . Unfortunately, there is a numerical drawback in the coefficient g ( x ) . From a 



numerical view, it is very difficult that the estimate a E in g(x) is equal to the ones in 
the coefficient g\x ) . Since the later is from the GVF. Thus, this will lead to an 



exponential term in g(x) , e bx ,/? e R , which leads to a very large errors for the noise 
points with large gradient magnitude. 

Another robust method for the coefficient of the inverse diffusion term is to 
redefine it as follow, 

g(x) = sx m g\x), m> 1, 

where, s is a magnitude scale, mis a parameter which controls the location of wave 
crest. This function is a bimodal function, in which each wave crest is similar to 
Gaussian distribution. Since x = |V/| in our case, we only need to consider a single 
wave crest of this function. When parameter m is changed from one to some integer, 
the central of the coefficient function g(x) will be moved from low to high. They are 
illustrated in Fig. 3. 



Tm= 1 Tn= 5 Iii= 5 




Fig. 3. The coefficient g(x) varies with parameter m changed (a E = 1). 




Image Anisotropic Diffusion Based on Gradient Vector Flow Fields 



295 



Since x m in g(x) is a power function, it could be eliminated by the exponential 

term e~ bx ,be R in g\x ) for the noise points with large gradient magnitude. Thus, 
the extension of Eq.(5) can be re-expressed as, 

I t = — sjV/| m y- V/ + gA/, m > 1 ^ 

It is obvious that the parameter m controls the inverse diffusion while parameter <j E 
controls the smoothing effect. And this two parameters are independent each other. 

However, the scheme of Eq.(6) is only one of the various forms of P-M Equation. 
The scheme of Eq.(5) is an open framework, in which there are plenty of other 
choices for designing the coefficient functions g(v), g'(x ) . 



3.4 Fairing Process and Fourth-Order Flow Model 

The fairing process is derived from CAD/CAM modeling techniques. A basic 
approach is to minimizing the two energy functions, i.e. min a^ic 2 ds for curve 

fairness, and min £ (xf + k\ )ds where x; and x; are the principal curvatures for 

surface fairness [17]. They are usually called the least total curvature. Recently, they 
were introduced in the anisotropic diffusion model of the curve and surface in [19,20]. 
The main idea is to smooth complicated noisy surface while preserve sharp geometric 
features. Under a variational framework, a fourth order PDE’s can be deduced from 
the least total curvature. In this section, we focus on the fourth order flow model in 
the plane, which will be introduced in the image anisotropic diffusion. 

In [21], the fourth order flow was presented as the intrinsic Laplacian of curvature 
under the level set framework, 

_ y xA - + sA 2 + k jK 

which is the second derivative of the local curvature k with respect to arc length 
parameter s. The particular geometric property of this flow is to improve the 
isoperimetric ratio, but not to reduce the enclosed area like the mean curvature flow. 
For comparison, the evolutions of a star shape under the intrinsic Laplacian of 
curvature flow, I t = xjV/|, and the mean curvature flow, I t = x* |V/|, are shown in 
Fig.4. 

It is obvious that the flow under the intrinsic Laplacian of curvature will converge 
to a circle finally. Because the isoperimetric ratio of circle is maximum if the 
perimeter is fixed. And the derivatives of the curvature converge uniformly to zero. 
Thus, the final solution to the flow under the intrinsic Laplacian of curvature should 
be a circle. In image diffusion, this fourth order flow model could preserve the 
boundaries of shapes but not smooth out them. Simultaneously, some small 
oscillations around the boundaries would be smoothed out. 
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* • • • 

a. flow under the intrinsic Laplacian of curvature (This example is from [21].) 




b. flow under the mean curvature 

Fig. 4. Evolution of star shape, iteration=15000 in (a), while iteration=1000 in (b) 



However, owing to the fourth order derivative term in evolution equation, it 
becomes highly sensitive to errors. And the fourth order derivative term leads to 
numerical schemes with very small time steps. In Fig.4(a), the space step Ax = 0.0667, 
the time step At = 5xlCT 6 , and more than 40 reinitializations are used. In fact, it is 
ill-posed to minimize the total squared curvature in the plane closed curves, 

(i.e. plane curve raveling). Because the total squared curvature is scale-dependent, it 
can be reduced as far as the gradient flow inflates any closed curve without limit. In 
order to make it well-posed, the total squared curvature was modified 

as, £(/f 2 +oc 2 )ds , in [13] and the corresponding gradient flow was deduced under a 

variational framework, 

C, =(^+A-(*- 2 -tf 2 )/2)N, (7) 

where C is a plane closed evolving curve, N is normal vector and 0 is the penalty 
function to make the problem well-posed. A few important conclusions to Eq.(7) from 
[13] need to be highlighted as follows, 

• The long time existence of solution to Eq.(7) is proven. And a stationary 
solution can be reached. 

• If the descent flow corresponding to the ‘pure’ energy is considered, 

the normal speed is simply F = ic ss + jc 3 / 2. 

• Not only can the flow smooth the embedded curves, but also the immersed 
curves. 

The intrinsic Laplacian of curvature has been introduced in active contours as a 
frigid force for 2D and 3D segmentation in [18]. In fact, it also could be applied in the 
anisotropic diffusion of images. Because of the isoperimetric property from the 
intrinsic Laplacian of curvature term, the boundaries of shapes in the evolving image 
will become vivid. We will deduce the intrinsic Laplacian of curvature directly from 
the GVF fields for simplifying computation and numerical stability. 
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Consider the GVF form of Eq.(2). We can obtain the second derivative of I in the 
direction orthogonal to the gradient, which can be formulated as, 

= A x kl - v • N . Because Ig can be written as a “quasi divergence form” [16], 

lg - |V/|V • (V//|V/|) , we have, A X M - v • N = A 2 k , where, A 2 = |V/|^ > 0 , 
k - V • (V//|V/|), which can be looked upon as a curvature flow. In general, the 
curvature flow evolves along the direction of gradient. The above equation can be 
defined as a force field along the direction of gradient, E = A 2 vN . The derivative of 

the field E with respect to the arc length is followed from the Frenet-Serret 
Formulation, 



E s =(A 2 k) s N-(A 2 k)kT, 

where, T is an unit tangent vector, and T-N = 0, k is the intensity contour curvature 
which is only from the observed images. Furthermore, the second derivative of the 
fields E with respect to the arc length is yielded, 

E „ = {(A 2 k) ss - (A 2 k)k 2 )N - (2(1 2 k) s k + (A 2 k)k s )T . 

For the gradient flow, we have, E ss -N = (A 2 k) ss -(A 2 k)k 2 . Obviously, it is the 
normal speed of the gradient flow corresponding to the ‘pure’ energy |/r 2 ds . Denote 
K-A 2 k for convenience. The second derivative of K can be expressed, 



=- 



K 



jl+KJl 



ll+ll 



KJ*+K y I y 
fc - 






Il+Iy 



where K can be estimated from the GVF and the observed image, N is from the 
observed image and A v A 2 can be estimated using the gradient \Vl\. Hence, the flow 
under the intrinsic Faplacian of curvature is rewritten as, 

It — - (E ss • N)|V/| = (kk 1 - K ss ]|V/| . 

It can be noticed that all the fourth order derivate terms in the above intrinsic 
Faplacian of curvature flow are estimated from the GVF fields, but not directly from 
the observed image. This improves the numerical stability effectively. 

However, the above intrinsic Faplacian of curvature flow equation is ill-posed. 
Compared to the scheme of Eq.(7), it is lack of a mean curvature term. In order to 
make it well-posed, we couple the scheme of Eq.(4) to the above equation. It can be 
easily addressed, 

I, = yff(rxjv/| - (1 - r)v • V/)+ (j Kk 2 - K ss )v/| , ( 8 ) 

where /? is a constant that balances the contribution between the mean curvature flow 
and the fourth order flow. 
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4 Experiments 

We first illustrate the GVF-based shock filter on a Mammographic image in Fig.5(a- 
c). Since this kind of medical images is usually very blurry, which need be deblurred, 
the shock filter is a stable deconvolution filter. By way of comparison, the 
experimental result of the original shock filter is shown too. It could be noticed that 
the original shock filter over-enhanced the details in Fig. 5(b), while the scheme of (3) 
could enhance the essential structures and suppress some trivial details in Fig. 5(c). 
Fig. 5(h) demonstrate that the scheme Eq.(3) can reach a steady state solution more 
quickly than the original scheme with the evolution error, which is calculated as, 

1 M,N i 

error (t) = V /f? - /. (0) , where M and N are the width and height of image 

MxNff 1 ’ 7 l ' J 1 

respectively (the following experiments all adopted this error equation to generate 
error diagrams). 

The original mean curvature flow and the proposed scheme of (4) are illustrated on 
a noisy and blurry image in Fig.5(d-g) respectively. The original image is firstly 
degraded with Gaussian Noise (zero mean, with 0.1 variance), then blurred by 
Gaussian lowpass filter with 0.25 variance. We adopted I t = ft|V/| in section 3.2 as the 
original mean curvature flow scheme. It can be noticed that the features of the lily 
image are enhanced and denoised effectively by the scheme of (4) in Fig. 5(g), while 
all the features are smoothed out by the original scheme in Fig. 5(f). The error 
diagrams in Fig.5(i) demonstrate that the scheme of (4) could reach a steady state 
solution and preserve the essential structures of shapes, while the original curvature 
flow scheme would eventually smooth out all the information. 

In the experiments of the P-M equation, we first compared the original scheme 
with the proposed scheme of (5). The original water lily image is degraded with 
Gaussian noise, and blurred as in Fig. 5(e). The diffusion results are shown in 
Fig.6(a,b). Their diffusion effects seem to be close. But then, the error diagram in 
Fig. 6(c) demonstrates that the scheme of (5) could reach a steady state solution more 
quickly than the original P-M equation. 

Then, in the successive experiments, we also illustrated the scheme of (6) on the 
water lily image, which is blurred and degraded with Gaussian noise as in Fig. 5(e). 
The experimental results are shown in Fig. 7. Obviously, when the coefficient m is 
becoming large, some details with large gradient are enhanced, while others with 
small gradient are eroded gradually. Since the center of the inverse diffusion term in 
Eq.(6) would be moved along with the change of the coefficient m. 

In the experiment of the 4 th order flow scheme, we also illustrated the scheme of (8) 
on the noised and blurry water-lily image as in Fig. 5(e). The result is compared with 
the one of the scheme Eq.(5) in Fig. 8. Obviously, because of the intrinsic Laplacian of 
curvature term in Eq.(8), the boundaries of objects become very vivid in Fig. 8(b). It is 
clear to indicate that the isoperimetric property from the intrinsic Laplacian of 
curvature term would make the boundaries of shapes enhanced and smooth but not 
eroded in image anisotropic diffusion. 
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a. original image. b. by the original shock filter c. by the scheme of (3) 




d. original image e. noisy and blurry image f. by original scheme g. by scheme of (4) 



: Original schema : Original scheme 

Error EToposed schema Error . Exposed scheme 

^ 30 [ 




h. Error diagram of the sock filter i. Error diagram of mean curvature flow 



Fig. 5. Evolutions of shock filter and mean curvature flow scheme 



a. by original scheme b. by scheme of (5) 



: Original scheme 
: Proposed scheme 




100 200 300 

c. Error diagram 



Fig. 6. Evolutions of P-M equation with o E - 0.45, (a) iteration=150, (b) iteration=200 
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a. m= 1 b. m-2 c. m = 3 d. m= 4 

Fig. 7. Evolutions of the scheme of (6) with <J E = 0.45, s=l and iteration=200 




a. Evolution of scheme (5) b. Evolution of scheme (8) 

Fig. 8. Evolution of schemes (5) and (8) on water lily image at iteration=200 



5 Conclusions 

In this paper, we firstly introduced the gradient vector flow (GVF) fields in the image 
anisotropic diffusion. Some well-known PDE’s diffusion models were reformulated 
based on the GVF fields, such as the shock filter, the mean curvature flow and the P- 
M diffusion model. The particular advantages that the GVF leads to are to simplify 
computation, improve numerical stability, and perform well on noisy images. One of 
the most distinct advantages is that the proposed GVF-based anisotropic diffusion 
models could reach a steady state solution more quickly than the original ones. 
Besides that, in order to enhance and smooth the boundaries of object but not to erode 
them, the intrinsic Laplacian of curvature was firstly introduced in the anisotropic 
diffusion of images. Since this flow contains a fourth order derivative term, it is very 
sensitive to errors. We can obtain the robust estimate of this flow from the GVF 
fields. The experiments indicate that our proposed models are robust and practical on 
account of the GVF fields. 

In future works, the GVF-based P-M diffusion equation is going on being 
concerned. How to design the coefficient in reverse diffusion term is always a key 
problem. In addition, we will try to apply these proposed diffusion models on 3D 
volume data for visualization. Since the classification process is critical to direct 
volume rendering, while the anisotropic diffusion could provide us the desired 
classification results. 
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Abstract. In this paper, we propose a particle filtering approach for tracking 
applications in image sequences. The system we propose combines a measurement 
equation and a dynamic equation which both depend on the image sequence. 
Taking into account several possible observations, the likelihood is modeled as a 
linear combination of Gaussian laws. Such a model allows inferring an analytic 
expression of the optimal importance function used in the diffusion process of the 
particle filter. It also enables building a relevant approximation of a validation gate. 
We demonstrate the significance of this model for a point tracking application. 



1 Introduction 

When tracking features of any kind from image sequences, several specific problems 
appear. In particular, one has to face difficult and ambiguous situations generated by 
cluttered backgrounds, occlusions, large geometric deformations, illumination changes 
or noisy data. To design trackers robust to outliers and occlusions, a classical way consists 
in resorting to stochastic filtering techniques such as Kalman filter [13,15] or sequential 
Monte Carlo approximation methods (called particle filters) [7,10,11,16]. 

Resorting to stochastic filters consists in modeling the problem by a discrete hidden 
Markov state process xo :n = {xo, xi, ..., x n } of transition equation p(xfc|xfc_i). The 
sequence of incomplete measurements of the state is denoted z 1:n = {zi , zi , z n }, of 
marginal conditional distribution p(zfc|xfc). Stochastic filters give efficient procedures 
to accurately approximate the posterior probability density p(xfc|zi : fc). This problem 
may be solved exactly through a Bayesian recursive solution, named the optimal filter 
[10]. In the case of linear Gaussian models, the Kalman filter [1] gives the optimal 
solution since the distribution of interest p(xfc|zi : fc) is Gaussian. In the nonlinear case, 
an efficient approximation consists in resorting to sequential Monte Carlo techniques 
[4,9]. These methods consists in approximating p(x.k\%i:k) in terms of a finite weighted 
sum of Diracs centered in elements of the state space named particles. At each discrete 
instant, the particles are displaced according to a probability density function named 
importance function and the corresponding weights are updated through the likelihood. 

For a given problem, a relevant expression of the importance function is a crucial 
point to achieve efficient and robust particle filters. As a matter of fact, since this function 
is used for the diffusion of the particle swarm, the particle repartition - or the state- 
space exploration - strongly depends on it. It can be demonstrated that the optimal 
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importance function in the sense of a minimal weight variance criterion is the distribution 
p(xfc|xfc_i,Zfc) [9]. As it will be demonstrated in the experimental section, the knowledge 
of this density improves significantly the obtained tracking results for a point tracking 
application. 

However, the expression of p(xfc|x£_i, z&) is totally unknown in most vision ap- 
plications. In such a context, the importance function is simply fixed to the prediction 
density p(x/e | x&_ i ) . This constitutes a crude model which is counterbalance by a system- 
atic re- sampling step of the particles together with sound models of highly multimodal 
likelihood [7,11,16]. 

In this paper, an opposite choice is proposed. We investigate simpler forms of like- 
lihood but for which the optimal importance function may be inferred. The considered 
likelihood is a linear combination of Gaussian laws. In addition, such a modelization 
allows expressing a validation gate in a simple way. A validation gate defines a bounded 
research region where the measurements are looked for at each time instant. 

Besides, it is interesting to focus on features for which none dynamic model can be 
set a priori or even learned. This is the case when considering the most general situation 
without any knowledge on the involved sequence. To tackle this situation, we propose 
to rely on dynamic models directly estimated from the image sequence. 

For point tracking applications, such a choice is all the more interesting that any 
dynamic model of a feature point is very difficult to establish without any a priori 
knowledge on the evolution law of the surrounding object. As a consequence, the system 
we propose for point tracking depends entirely on the image data. It combines (7) a 
state equation which relies on a local polynomial velocity model, estimated from the 
image sequence and (ii) a measurement equation ensuing from a correlation surface 
between a reference frame and the current frame. The association of these two approaches 
allows dealing with trajectories undergoing abrupt changes, occlusions and cluttered 
background situations. 

The proposed method has been applied and validated on different sequences. It has 
been compared to the Shi-Tomasi-Kanade tracker [17] and to a CONDENSATION-Iike 
algorithm [11]. 



2 Nonlinear Image Sequence Based Filtering 

Classical formulation of filtering systems implies to a priori know the density 
p(xfc+i|xfc), and to be able to extract, from the image sequence, an information used 
as a measurement of the state. However, in our point of view, feature tracking from 
image sequences may require in some cases to slightly modify the traditional filtering 
framework. These modifications are motivated by the fact that an a priori state model 
is not always available, especially during the tracking of features whose nature is not 
previously known. A solution to this problem may be devised relying on an estimation 
from the image sequences data of the target dynamics [2,3]. In that case, it is important 
to distinguish (i) the observation data which constitute the measurements of the state 
from (ii) the data used to extract such a dynamics model. These two pieces of informa- 
tion are of different kinds even if they are both estimated from the image sequence - 
and therefore depend statistically on each other. In this unconventional situation where 
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dynamics and measurements are both captured from the sequence, it is possible to build 
a proper filtering framework by considering a conditioning with respect to the image 
sequence data. 



2.1 Image Sequence Based Filtering 

Let us first fix our notations. We note 1^ an image obtained at time k. Io :n represents the 
finite sequence of random variables {Ik, k = 0, n}. Knowing a realization of Io : fc, 
our tracking problem is modeled by the following dynamic and measurement equation: 

x fc = fl 0:k ( x k- 

Zfe = /lfe 0 :is (Xfe,Vfe 0: ' ! ). 

At each time k, a realization of z& is provided by an estimation process based on image 
sequence Io : fe- Functions fl° :k and are assumed to be any kind of possibly nonlinear 

functions. These functions may be estimated from Io : fc- The state noise w* 0:fe and the 
measurement noise v* 0:fc may also depend on Io : fe as well, and are not necessarily 
Gaussian. We assume that the associated probability distributions are such that 

p(Xfc|x 0: fc_i, Z\ : k — l, I():n) = _P( x /c| x /c — 1? I():n)? 

Pfakl^-O-k, ^l-.k — l, Io:n) = p{^k\^-k, Io:n)* 

By analogy with the classical filtering formulation the Markovian assumption, as well 
as the conditional independence of the observations are maintained conditionally to 
the sequence. A causal hypothesis with respect to the temporal image acquisition is 
added. Such an hypothesis means that the state and the measurement z k are as- 
sumed to be independent from I/c+i :n - The optimal filter’s equations can be applied 
to the proposed model. The expected posterior reads now p(xfc|zi : fc, lo-.k)- Supposing 
p(xfc_i|zi : fc_i, Io:/c— 1 ) known, the recursive Bayesian optimal solution is: 

, , T \ p(z fc |x fe ,I 0 :fe) /p(Xfc|x fc _i,I 0 :fe) P( x fe-l|zi:/c-l,l 0 :fe-l) rfXfc-l 

p(x k Z 1:k ,I 0 :k) = r 7 i j T j • 

J p(z fe |x fc ,Io :fe ) p(Xfe|Zi : fc_i,I 0 :fe) 

To solve this conditional tracking problem, standard filters have to be derived in a con- 
ditional version. The linear version of this framework, relying on a linear minimal con- 
ditional variance estimator, is presented in [2,3]. The nonlinear version is implemented 
with a particle filter and is called Conditional NonLinear Filter. 



2.2 Conditional NonLinear Filter 

Facing a system with a nonlinear dynamic and/or a nonlinear likelihood, it is not possible 
anymore to construct an exact recursive expression of the posterior density function of 
the state given all available past data. To overcome these computational difficulties, 
particle filtering techniques propose to implement recursively an approximation of this 
density (see [4,9] for an extended review). These methods consist in approximating the 
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posterior density by a finite weighted sum of Dirac centered on hypothesized trajectories 
- called particles - of the initial system x 0 : 

N 

P(Xfc|z 1:fe ,I 0 :fc) ~ y^«4 j) (5(x 0:fe - X^ } fc ). 

At each time instant (or iteration), the set of particles {x^f i = 1, N} is drawn 
from an approximation of the true distribution p(xo:k \zi:k, Io : /c)> called the importance 
function and denoted ^(xor/dzi^, lo-.k)- The closer is the approximation from the true 
distribution, the more efficient is the filter. The particle weights account for the de- 
viation with regard to the unknown true distribution. The weights are updated according 
to importance sampling principle: 



„ .(») _ p(zi: fe |x«,I 0 : fe )p(xW fe |Io: fe ) 

k ~ l (*) I T \ 

Choosing an importance function that recursively factorizes such as: 

7r( x 0:fc| z l:fc> I():fc) = 7r( x 0:fe-l \ z l:k-l > I():fc-l) 7r( x fc | x 0 :/c-l > z l:/c j I():fc) 



allows recursive evaluations in time of the particle weights as new measurements 
become available. Such an expression implies naturally a causal assumption of the im- 
portance function w.r.t. observations and image data. The recursive weights read then: 



w 



W = w k-i P( z k[- 



,(0 



Io-.k)p{^k\ 



c (i) 



Io :fc )/7r(x^ ) | 



(i) T \ 



Unfortunately, such a recursive assumption of the importance function induces an 
increase over time of the weight variance [12]. In practice, this makes the number of 
significant particles decrease dramatically over time. To limit such a degeneracy, two 
methods have been proposed (here presented in the conditional framework). 

A first solution consists in selecting an optimal importance function which mini- 
mizes the variance of the weights conditioned upon x 0: fc-i, z and Io : /c in our case. 
It is then possible to demonstrate that p(xfc|xfc_i, z&, I 0: /c) corresponds to this optimal 
distribution. With this distribution, the recursive formulation of Wk becomes then: 

W { k = W k - 1 fK Z fc| X fc- 15 IO:fc)- (!) 

The problem with this approach is related to the fact that it requires to be able to sample 
from the optimal importance function p(xfc|x£_i, z^, Io : /c), and to have an expression 
of p(zfc|xfc_i, Io:fc). In vision applications, the optimal importance function is usu- 
ally not accessible. The importance function is then set to the prediction density (i.e. 
7r(xfc|xo:fc-i, zi : fc) = p(xfc|xfc_i)). Such a choice excludes the measurements from the 
diffusion step. 

A second solution to tackle the problem of weight variance increase relies on the 
use of re-sampling methods. Such methods consist in removing trajectories with weak 
normalized weights, and in adding copies of the trajectories associated to strong weights, 
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as soon as the number of significant particles is too weak [9]. Obviously, these two 
solutions may be coupled for a better efficiency. Nevertheless it is important to outline 
that the resampling step introduces errors and is only the results of the discrepancy 
between the unknown true pdf and the importance function. As a consequence, the 
resampling step is necessary in practice, but should be used as rarely as possible. It can 
be noticed that setting the importance function to the diffusion process and resampling 
at each iteration leads to weight directly the particles with the likelihood. This choice 
has been made in the condensation algorithm [11]. 

As mentioned previously, it may be beneficial to know the expression of the optimal 
importance function. As developed in the next section, it is possible to infer this function 
for a specific class of systems. 

3 Gaussian Systems and Optimal Importance Function 

Filtering models for tracking in vision applications are traditionally composed of a 
simple dynamic and a highly multimodal and complex likelihood [3] . For such models, an 
evaluation of the optimal importance function is usually not accessible. In this section, we 
present some filtering systems relying on a class of likelihoods (eventually multimodal) 
for which it is possible to sample from the optimal importance function. 



3.1 Gaussian System with Monomodal Likelihood 

We consider first a conditional nonlinear system, composed of a nonlinear state equation, 
with an additive Gaussian noise, and a linear Gaussian likelihood: 

x fc = fl 0:k (xk-i) + w* 0 ^, w* 0:fe Af(wl 0:h ;0,Q]? :k ) (2) 

z fe = fl*°=»x fc + V(vy fe ; 0, Rl 0 k ). (3) 

For these models the analytic expression of the optimal importance function may be 
inferred. As a matter of fact, noticing that: 

p(Zfe|x fe _i,I 0 :fc) = /p(z fe |x fc ,Io: fc )p(x fe |x fe _i,I 0 : fc ) dx k , (4) 



we deduce: 



Zfclxfc^ijlo:* V(z fe ;i? fe / fe (x fc _i),i? fe +H k Q k Hl), (5) 

which yields a simple tractable expression for the weight calculation (1) (for the sake 
of clarity, the index Io : /c has been omitted). As for the optimal importance function we 
have: 



p(Xfe|x fe _i, Zfe, I 0: fc) = p(z k \x k , I 0: fc) j>( X fc l X fe-l , Io ; fc)/p(Zfc |x fc _l , I 0: fe) (6) 
and thus, 



Xfc|Xfc_i, Zfc, I 0:k ^ Vk, Zk)* 



(7) 




Optimal Importance Sampling for Tracking in Image Sequences 307 



with 



S k = (Qp+H* k Rj; 1 H k )- 1 
Pk = s k (Q- 1 f k (x. k -i) +H t k R~ 1 z fe ). 



In that particular case, all the expressions used in the diffusion process (7), and in 
the update step (5) are Gaussian. The filter corresponding to these models is therefore 
particularly simple to implement. The unconditional version of this result is described 
in [9]. 



3.2 Extension to Multimodal Likelihood 



Considering only one single measurement can be too restrictive facing ambiguous situa- 
tions or cluttered background. We describe here an extension of the previous monomodal 
case to devise a multimodal likelihood. 

Let us now consider a vector of M measurements = {z^pz^, z k,Af}- As 

it is commonly done in target tracking [6] and computer vision [11], we assume that 
a unique measurement corresponds to a true match and that the others are due to false 
alarms or clutter. Noting ^ a random variable which takes its values in 0, ..., M, we 
designate by p(@k = m ) the probability that measurement z k,m corresponds to the true 
measurement at time fc; p(<Pk = 0) is the probability that none of the measurements 
corresponds to the true one. Denoting Pk,m = p{^k = ra|xfc, Io : fc)> and assuming that 
\/m = 1, ..., M, the measurements z k,i-.M are independent conditionally to X&, Io : fc and 
then the likelihood can be written as: 

P( Zfc|Xfc,Io:fc) = Pfe,OP(Zfc|Xfe,Io:fc,^ fe = 0) 

+ ^2{Pk,m P(Zk,m\Xk,l0:k,&k = f[ P^k,j |Xfe, I 0; fc, = m)}. 

m= 1 j^rn 



In order to devise a tractable likelihood for which an analytic expression of the opti- 
mal importance function may be derived, we make the following hypothesis. We assume 
that (/) the set of mode occurrence probabilities {pk,i, i = 1, • • • , M} is estimated from 
the images at each instant ; (ii) the probability of having no true measurement is set 
to zero (pk , o = 0). Such a choice differs from classical tracking assumptions [6,11] 
and may be of problematic in case of occlusions. Nevertheless, as we will see it, this 
potential deficiency is well compensated by an efficient estimation of the measurement 
noise covariances. We also assume that (iii) considering as being the true target- 
originated observation, it is distributed according to a Gaussian law of mean 
and covariance Rk, m - As a last hypothesis (7v), we assume that the false alarms are 
uniformly distributed over a measurement region (also called gate) at time k. The total 
area of the validation gate V& will be denoted | V& I • 
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All these assumptions lead to an observation model which can be written as a linear 
combination of Gaussian laws: 



M 

Zfc|Xfc, I():fc ^ ^ 
m= 1 



Pk ,m V(z k 

,1711 H-k,m Xfc 5 Rk, 



r) n 

j^m 



11 



z k,j 

w 



i 

i^r- 1 



M 

,ra •A/* (z& ?m 5 H-k,m 

m = 1 



Xfc , Rk,m) - 



( 9 ) 



In the same way as for the monomodal measurement equation (§3.1), it is possible 
for such a likelihood associated to a Gaussian state equation of form (2) to know the 
optimal importance function. Let us remind that in our case the considered diffusion 
process requires to evaluate _p(z^ |x^_i , lo-.k) and to sample fromp(x/ c |x/ c _i, z^, I 0: /c)- 
Applying identity (4), with expression (9), the density used for the weight recursion 
reads: 



Zfc|Xfc_i, I():fc ^ 



1 

I Vkl**- 1 



M 

y ^ J Pk,m^' ( z /c,m5 R-k,mf ki^-k — l) i Hk,mRk,mHk,m 

m= 1 



+ Qk)- 



( 10 ) 



The optimal importance function is deduced using identity (6) and expression (10): 



Xfc|Xfc_i, Zfc, Io:fc ^ 

•^(x/c; /fc(Xfc-i), Qfc) E m =l Pfe,m*A/*(Xfe; H km Zi c , R- krn Rk,mHk m) 

p(z fc |x fc _i,I 0 : fc ) 

Through Gaussian identities this expression reads as a Gaussian mixture of the form: 

M 

Xfc |Xfc_i , , Io:fc y ^ Pk,m ^T~ - A/”(xfe; /j, k 

,mi ^ k,m ) (ID 

m=l 



with 



^ k,m 
pk,m 

S- 



Qk + 

= ZJk,m (Qk f k ( x fe— 1 ) + H k m 
|14| M_1 P(z/c|x/ c _i,Io:/c) 



^fc.m Z k,m) 



(%k,m — 



1 2Jk,r 



1 



27r|i2fe jr; 



nQ/cl 2 2 






+K m iiL 1 -iK»»ii: c-O) 



Let us point out that the proposed systems lead to a simple implementation as the 
involved distributions are all combinations of Gaussian laws. In addition, as described 
in the next subsection, such systems allow to define a relevant validation gate for the 
measurements. 
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3.3 Validation Gate 

When tracking in cluttered environment, an important issue resides in the definition of 
a region delimiting the space where future observations are likely to occur [6]. Such a 
region is called validation region or gate. Selecting a too small gate size may lead to miss 
the target-originated measurement, whereas selecting a too large size is computationally 
expensive and increases the probability of selecting false observations. 

In our framework, the validation gate is defined through the use of the probabil- 
ity distribution p(zfc|zi : fc_i, Io : fc). For linear Gaussian systems, an analytic expression 
of this distribution may be obtained. This leads to an ellipsoidal probability concen- 
tration region. For nonlinear models, the validation gate can be approximated by a 
rectangular or an ellipsoidal region, whose parameters are usually complex to define. 
Breidt [8] proposes to use Monte Carlo simulations in order to approximate the density 
p(zfc|zi : fc_i, I 0: fc), but this solution appears to be time consuming. For the systems we 
propose, it is possible to approximate efficiently this density by a Gaussian mixture. The 
corresponding validation gate V& consists in an union of ellipses. Observing that: 

z fc| z l:fc-lj I():fc ^ J P( z /e| x /c — 1, I():fc) P( x fc-1 | z l:fc-l> — l) ^ x /c-l 5 

and reminding that an approximation of p(x£_i | z i:/c-i,Io:/c-i)is given by the weighted 
swarm of particles the following approximation can be done: 

P(Zfe|zi: fe -i,I 0 :/b) - ^2 w k-lP( Zfclxjy^Io^)- ( 12 ) 



Introducing expression (10) in (12) leads to an expression of p(zk\zi : k~i, lo-.k) as a 
combination of TV x M Gaussian distributions (N is the number of particles). As con- 
sidering N x M ellipses is computationally expensive, we approximate the density by 
a sum of M Gaussian laws. We then finally obtain an approximation of Vk as an ellipse 
union V k = Um=l:M = { e m ■ (e m - “ Sfc.m) < 7m} with the first 

and second moments defined as: 

' £fc,m = Yl W k-l H k,m fk(^k-l) 

< C k ,m = 5>i-l 

x i i 

The parameter 7 m is chosen in practice as the 99th percentile of the probability for z k, m 
to be the true target-originated measurement. 

In addition to a simple and optimal sampling process, the possibility to build a relevant 
approximation of a validation gate constitutes another advantage of the Gaussian models 
we propose. In order to demonstrate experimentally their significance, these systems have 
been applied to a point tracking application. 
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4 Application to Point Tracking 

The objective of point tracking consists in reconstructing the 2D point trajectory along the 
image sequence. To that purpose, it is necessary to make some conservation assumptions 
on some information related to the feature point. These hypotheses may concern the point 
motion, or a photometric/geometric invariance in a neighborhood of the point. 

The usual assumption of luminance pattern conservation along a trajectory has led to 
devise two kinds of methods. The first ones are intuitive methods based on correlation [5] . 
The second ones are defined as differential trackers, built on a differential formulation 
of a similarity criterion. In particular, the well-known Shi-Tomasi-Kanade tracker [17] 
belongs to this latter class. 

In this paper, the proposed approach for point tracking is also built on the basis of 
luminance pattern consistency. In this application, each state x& represents the location 
of the point projection at time k, in image I k . In order to benefit from the advantages 
of the two class of method, we propose to combine a dynamic relying on a differential 
method and measurements based on a correlation criterion. The system we focus on 
is therefore composed of measurements and dynamic equations which both depend on 
Io:fc. The noise covariance considered at each time is also automatically estimated on 
the image sequence. To properly handle such a system, the point tracker is built from 
the filtering framework presented in § 2. 



4.1 Likelihood 

At time k, we assume that x& is observable through a matching process whose goal is to 
provide the most similar points to x 0 from images Iq and l k . The result of this process 
is the measurement vector z k . Each observation z k:Tn corresponds to a correlation peak. 
The number of correlation peaks (or components of z k ) is fixed to a given number. 
Several matching criteria can be used to quantify the similarity between two points. The 
consistency assumption of a luminance pattern has simply led to consider the sum-of- 
squared-differences criterion. 

As in [18] the correlation surface, denoted r k (x, y ) and computed over the validation 
gate V k , is converted into a response distribution: V k = exp(— c r k (x,y)), where c 
is a normalizing factor, fixed such as f Vk V k = 1. This distribution is assumed to 
represent the probability distribution associated to the matching process. The relative 
height of the different peaks defines the probability Pk,m of the different measurements 
Zk, m - The covariance matrices R k ,m are estimated from the response distribution on 
local supports centered around each observation. A Chi-Square ‘“goodness of fit” test is 
realized, in order to check if this distribution is locally better approximated by a Gaussian 
or by a uniform law [3]. An approximation by a Gaussian distribution indicates a clear 
discrimination of the measurement, and R k ,m is therefore set to the local covariance of 
the distribution. At the opposite, an approximation by a uniform distribution indicates 
an unclear peak detection on the response distribution. This may be due to an absence 
of correlation in presence of occlusions or noisy situations. In this case, the diagonal 
terms of R k ,m are fixed to infinity, and the off-diagonal terms are set to 0. Finally, in 
this application, matrices H k , m are set to identity. 
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initial conditions frame #0 




(a) Conditional Nonlinear Filter (with optimal importance function) #0,13,18,30 




(b) CONDENSATION-Iike algorithm (without optimal importance function) #13,18,30 



Fig. 1. Interest of the optimal interest function in case of occlusion. Tracking results obtained 
with (a) the Conditional Nonlinear Filter, with the use of optimal importance function and 
(b) a CONDENSATION-Iike algorithm, without the use of optimal importance function. For both 
algorithms, the considered filtering system is the one described in §3.2. Black crosses present 
the estimates. White and gray crosses corresponds to the observations, and the ellipses to their 
associated validation gates. The white crosses show the measurement of highest probability. 



4.2 Dynamic Equation 

As we wish to manage situations where no a priori knowledge on the dynamic of the 
surrounding object is available, and in order to be reactive to any unpredictable change of 
speed and direction of the feature point, the dynamic we consider is estimated from I 0: fc. 
The state equation describes the motion of a point x^-i between images k — 1 and k , and 
allows a prediction of x^. A robust parametric motion estimation technique [14] is used 
to estimate reliably a 2D parametric model representing the dominant apparent velocity 
field on a given support 1Z. The use of such a method on an appropriate local support 
around x^_i provides an estimate of the motion vector at the point x^_i from images 
Ik-i and Ik. As 1Z is a local domain centered at x&_i, the estimated parameter vector 
depends in a nonlinear way on The noise variable accounts for errors related 
to the local motion model. It is assumed to follow a zero mean Gaussian distribution of 
fixed covariance . 
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(a) initial points (b) CNLF (c) STK 



Fig. 2. Corridor sequence 

5 Experimental Results 

In this section, we present some experimental results on four different sequences to 
demonstrate the efficiency of the proposed point tracker. 

The first result is presented to demonstrate the interest of the optimal importance 
function. To that purpose, we have chosen to study an occlusion case, on the Garden 
sequence. This sequence shows a garden and a house occluded by a tree. Let us focus 
on a peculiar feature point located on the top of a house roof. This point is visible in the 
two first images and stays hidden from frame #3 to frame #15. Two algorithms have 
been tested for the tracking of this point. Both of them rely on the same filtering system 
(the one described in section § 3.2). The first one is the method we propose (namely the 
Conditional NonLinear Filter (CNLF), with the use of the optimal importance function), 
whereas the second one is a CONDENDATION-Iike algorithm, for which the considered 
importance function is identified to the diffusion process. Figure 1 presents the obtained 
results. The use of the optimal importance function allows us to recover the actual point 
location after a long occlusion. This shows clearly the benefit that can be obtained when 
taking into account the measurement in the diffusion process. 

The second sequence, Corridor, constitutes a very difficult situation, since it com- 
bines large geometric deformations, high contrast, and ambiguities. The initial points 
and the final tracking results provided by the Shi-Tomasi-Kanade (STK) tracker, and 
the CNLF are presented in figure 2. In such a sequence, it can be noticed that the STK 
leads to good tracking results only for a small number of points. On the opposite, for the 
CNLF, the trajectories of all the feature points are well-recovered. Let us point out that 
for this sequence, considering one or several observations per point leads nearly to the 
same results. Another result of the CNLF, with a multimodal likelihood, is presented on 
the sequence Caltra. This sequence shows the motion of two balls, fixed on a rotating 
rigid circle, on a cluttered background. Compared to STK (fig. 3), the CNLF succeeds in 
discriminating the balls from the wall-paper, and provides the exact trajectories. Such 
a result shows the ability of this tracker to deal with complex trajectories in a cluttered 
environment. 

The last result on the hand sequence demonstrates that considering several obser- 
vations improves the tracking results in case of ambiguous situations. This sequence 
presents finger motions of one hand. Figure 4 illustrates the results obtained with the 
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initial points # 0 



Conditional Nonlinear Filter (# 7,20,30,39) 



CNLF trajectories 



Shi-Tomasi-Kanade tracker (# 7,20,30) 



Fig. 3. Caltra sequence 




(a) CNLF - monomodal likelihood (M=l) #0,8,16,23,31 




(b) CNLF - multimodal likelihood (M=3) #0,8,16,23,31 



Fig. 4. Hand sequence: Conditional NonLinear Filter results, (a) Only one observation is con- 
sidered per point and the involved system is the one described in §3.1; (b) 3 observations are 
considered per point and the involved system is the one described in §3.2 



CNLF, considering a monomodal likelihood (a) and a multimodal likelihood (b) . As it 
can be observed, considering only one correlation peak per point leads here to mistake 
the different fingers. This confusing situations are solved by taking into account several 
(here, 3) observations. 

6 Conclusion 

In this paper, we proposed a Conditional NonLinear Filter for point tracking in image 
sequence. This tracker has the particularity of dealing with a priori - free systems, which 
entirely depend on the image data. In that framework, a new filtering system has been 
described. To be robust to cluttered background, we have proposed a ‘<peculiar class of 
multimodal likelihood. Unlike usual systems used in vision applications within non lin- 
ear stochastic filtering framework, we deal with system which allows an exact estimate 
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of the optimal importance function. The knowledge of the optimal function enables to 
include naturally measurements into the diffusion process and authorizes to build a rel- 
evant approximation of a validation gate. Such a framework, applied to a point tracking 
application, enables to significantly improve the result of traditional trackers. The result- 
ing point tracker has been shown to be robust to occlusions and complex trajectories. 
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Abstract. We describe a new approach for learning to perform class- 
based segmentation using only unsegmented training examples. As in 
previous methods, we first use training images to extract fragments that 
contain common object parts. We then show how these parts can be 
segmented into their figure and ground regions in an automatic learning 
process. This is in contrast with previous approaches, which required 
complete manual segmentation of the objects in the training examples. 
The figure-ground learning combines top-down and bottom-up processes 
and proceeds in two stages, an initial approximation followed by iterative 
refinement. The initial approximation produces figure- ground labeling of 
individual image fragments using the unsegmented training images. It 
is based on the fact that on average, points inside the object are cov- 
ered by more fragments than points outside it. The initial labeling is 
then improved by an iterative refinement process, which converges in 
up to three steps. At each step, the figure-ground labeling of individual 
fragments produces a segmentation of complete objects in the training 
images, which in turn induce a refined figure-ground labeling of the in- 
dividual fragments. In this manner, we obtain a scheme that starts from 
unsegmented training images, learns the figure-ground labeling of image 
fragments, and then uses this labeling to segment novel images. Our ex- 
periments demonstrate that the learned segmentation achieves the same 
level of accuracy as methods using manual segmentation of training im- 
ages, producing an automatic and robust top-down segmentation. 



1 Introduction 

The goal of figure- ground segmentation is to identify an object in the image and 
separate it from the background. One approach to segmentation - the bottom-up 
approach - is to first segment the image into regions and then identify the image 
regions that correspond to a single object. The initial segmentation mainly relies 
on image-based criteria, such as the grey level or texture uniformity of image 
regions, as well as the smoothness and continuity of bounding contours. One 
of the major shortcomings of the bottom-up approach is that an object may be 
segmented into multiple regions, some of which may incorrectly merge the object 
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with its background. These shortcomings as well as evidence from human vision 
[1,2] suggest that different classes of objects require different rules and criteria 
to achieve meaningful image segmentation. A complementary approach, called 
top-down segmentation , is therefore to use prior knowledge about the object at 
hand such as its possible shape, color, texture and so on. The relative merits of 
bottom- up and top-down approaches are illustrated in Fig. 1. 

A number of recent approaches have used fragments (or patches) to perform 
object detection and recognition [3, 4, 5, 6]. Another recent work [7] has extended 
this fragment approach to segment and delineate the boundaries of objects from 
cluttered backgrounds. The overall scheme of this segmentation approach, in- 
cluding the novel learning component developed in this paper, is illustrated 
schematically in Fig. 2. The first stage in this scheme is fragment extraction 
(F.E.), which uses unsegmented class and non-class training images to extract 
and store image fragments. These fragments represent local structure of common 
object parts (such as a nose, leg, neck region etc. for the class of horses) and are 
used as shape primitives. This stage applies previously developed methods for 
extracting such fragments, including [8,4,5]. In the detection and segmentation 
stage a novel class image is covered by a subset of the stored fragments. A critical 
assumption is that the figure-ground segmentation of these covering fragments 
is already known, and consequently they induce figure-ground segmentation of 
the object. In the past, this figure-ground segmentation of the basic fragments, 
termed the fragment labeling stage (F.L.), was obtained manually. The focus 
of this paper is to extend this top-down approach by providing the capacity to 
learn the segmentation scheme from unsegmented training images, and avoiding 
the requirement for manual segmentation of the fragments. 

The underlying principle of our learning process is that class images are 
classified according to their figure rather than background parts. While figure 
regions in a collection of class-image samples share common sub-parts, the back- 
ground regions are generally arbitrary and highly variable. Fragments are there- 
fore more likely to be detected on the figure region of a class image rather than 
in the background. We use these fragments to estimate the variability of regions 
within sampled class images. This estimation is in turn applied to segment the 
fragments themselves into their figure and background parts. 

1.1 Related Work 

As mentioned, segmentation methods can be divided into bottom-up and top- 
down schemes. Bottom-up segmentation approaches use different image-based 
uniformity criteria and search algorithms to find homogenous segments within 
the image. The approaches vary in the selected image-based similarity criteria, 
such as color uniformity, smoothness of bounding contours, texture etc. as well 
as in their implementation. 

Top-down approaches that use class-based (or object-specific) criteria to 
achieve figure-ground segmentation include deformable templates [10], active 
shape models (ASM) [11] and active contours (snakes) [12]. In the work on de- 
formable templates, the template is designed manually for each class of objects. 
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Fig. 1. Bottom-up and Top-down segmentation (two examples): Left - input images. 
Middle - state-of-the-art bottom-up segmentation ([9]). Each colored region (middle- 
left) represents a segment and the edge map (middle-right) represents the segments’ 
boundaries. Right - class-specific segmentation (white contour) as learned automat- 
ically by our system. The bottom-up approach may segment objects into multiple 
parts and merge background and object parts as it follows prominent image-based 
boundaries. The top-down approach uses stored class-specific representation to give an 
approximation for the object boundaries. This approximation can then be combined 
with bottom-up segmentation to provide an accurate and complete segmentation of 
the object. 






Segmentation JJ 




Fig. 2. The approach starts from a set of class (C) and non-class (NC) training images. 
The first stage is fragment extraction (F.E.) that extracts a set of informative frag- 
ments. This is followed by fragment- labeling (F.L.), the focus of this work, in which 
each fragment is divided into figure and background. During recognition, fragments 
are detected in input images (fragment detection, F.D.). The fragments’ labeling and 
detection are then combined to segment the input images. 



In schemes using active shapes, the training data are manually segmented to 
produce aligned training contours. The object or class-specific information in 
the active contours approach is usually expressed in the initial contour and in 
the definition of the external force. In all of the above top-down segmentation 
schemes, the class learning stage requires extensive manual intervention. 

In this work we describe a scheme that automatically segments shape frag- 
ments into their figure and ground relations using unsegmented training images, 
and then uses this information to segment class objects in novel images from 
their background. The system is given a set of class images and non-class im- 
ages and requires only one additional bit for each image in this set (“class” 
/ “non-class”). 
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2 Constructing a Fragment Set (Fragment Extraction) 

The first step in the fragment-based approach is the construction of a set of 
fragments that represents the class and can be used to effectively recognize and 
segment class images. We give below a brief overview of the fragment extraction 
process, (further details of this and similar approaches can be found in [8,4,5].) 
The construction process starts by randomly collecting a large set of candidate 
fragments of different sizes extracted from images of a general class, such as 
faces, cars, etc. The second step is to select from the initial pool of fragments a 
smaller subset of the more useful fragments for detection and classification. These 
fragments are selected using an information measure criterion. The aim is that 
the resulting set be highly informative, so that a reliable classification decision 
can be made based on the detection of these fragments. Detected fragments 
should also be highly overlapping as well as being well-distributed across the 
object, so that together they are likely to cover it completely. The approach in [8] 
sets for each candidate fragment a detection threshold selected to maximize the 
mutual information between the fragment detection and the class. A fragment 
is subsequently detected in an image region if the similarity measure (absolute 
value of the normalized linear correlation in our case) between the fragment 
and that region exceeds the threshold. Candidates fj are added to the fragment 
set F s one by one so as to maximize the gain in mutual information / ( F s ; C) 
between the fragment set and the class: 

fj = argmax(/(F s U /; C) - I(F S ;C)) (1) 

This selection process produces a set of fragments that are more likely to be 
detected in class compared with non-class images. In addition, the selected frag- 
ments are highly overlapping and well distributed. These properties are obtained 
by the selection method and the fragment set size: a fragment is unlikely to be 
added to the set if the set already contains a similar fragment since the mutual 
information gained by this fragment would be small. The set size is determined 
in such a way that the class representation is over-complete and, on average, 
each detected fragment overlaps with several other detected fragments (at least 

3 in our implementation). 

3 Learning the Fragments Figure-Ground Segmentation 

To use the image fragments for segmentation, we next need to learn the figure- 
ground segmentation of each fragment. The learning process relies on two main 
criteria: border consistency and the degree of cover , which is related to the vari- 
ability of the background. We initialize the process by performing a stage of 
bottom-up segmentation that divides the fragment into a collection of uniform 
regions. The goal of this segmentation is to give a good starting point for the 
learning process - pixels belonging to a uniform subregion are likely to have the 
same figure-ground labeling. This starting point is improved later (Sect. 5). A 
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number of bottom-up segmentation algorithms were developed in the past to 
identify such regions. In our implementation we use the algorithm developed by 
[9], which is fast (less than one second for an image with 240x180 pixels) and 
segments images on several scales. We used scales in which the fragments are 
over- segmented (on average they divide the fragments into 9 subregions) pro- 
viding subregions that are likely to be highly uniform. The algorithm was found 
to be insensitive to this choice of scale (scales that give on average 4 — 16 sub- 
regions produce almost identical results). We denote the different regions of a 
fragment F by i?i, R<i, • • • , R n - Each region in the fragment (. Rj ) defines a subset 
of fragment points that are likely to have the same figure-ground label. 

3.1 Degree of Cover 

The main stage of the learning process is to determine for each region whether 
it is part of the figure or background. In our fragment-based scheme, a region 
Rj that belongs to the figure, will be covered on average by significantly more 
fragments than a background region iU, for two reasons. First, the set of ex- 
tracted fragments is sufficiently large to cover the object several times (7.2 on 
average in our scheme). Second, the fragment selection process extracts regions 
that are common to multiple training examples and consequently most of the 
fragments come from the figure rather than from background regions. Therefore, 
the number of fragments detected in the image that cover a fragment’s region 
Rj can serve to indicate whether Rj belongs to the figure (high degree of cover) 
or background (low degree of cover). The average degree of cover of each region 
over multiple images, (denoted by rj), can therefore be used to determine its 
figure-ground label. The value rj is calculated by counting the average number 
of fragments overlapping with the region over all the class images in the training 
set. The higher rj, the higher its likelihood to be a figure region (in our scheme, 
an average of 7.0 for figure points compared with 2.2 for background points). 
The degree of cover therefore provides a powerful tool to determine the figure- 
ground segmentation of the fragments. Using the degree of cover rj j = 1, . . . , n, 
for the n regions in the fragment, we select as the figure part all the regions with 
rj > r for some selected threshold f . That is, the figure part is defined by: 

P(f)= (J Rj (2) 

{ j:rj>r } 



In this manner, all the regions contained in a chosen figure part P(f) have 
a degree of cover higher or equal to f , while all other regions have a degree of 
cover lower than f . The segmentation of the fragment into figure and background 
parts is therefore determined by a single parameter, the degree of cover r . Since 
f = rk for some k = 1, . . . n, the number of possible segmentations is now reduced 
from 2 n to n. This stage, of dividing the fragment into uniform regions and then 
ranking them using the degree of cover, is illustrated in Fig. 3. We next show 
how to choose from these options a partition that is also consistent with edges 
found in image patches covered by the fragment. 
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Fig. 3. Degree of cover: a fragment segmented into uniform regions (a) is detected on 
a given object (b). The degree of cover by overlapping fragments (also detected on the 
object) indicates the likelihood of a region to be a figure sub-region, indicated in (d) 
by the brightness of the region. 



3.2 Border Consistency 

The degree of cover indicates the likelihood of a fragment region to belong to 
the figure part. We next determine the boundary that optimally separates figure 
from background regions (such a boundary will exist in the fragment, unless 
it is an internal fragment). A fragment often contains multiple edges, and it is 
not evident which of these corresponds to the figure-ground boundary we are 
looking for. Using the training image set, we detect the fragment in different 
class-images. We collect the image patches where the fragment was detected, 
and denote this collection by Hi, H 2 , . . . , Hj~. Each patch in this collection, Hj , 
is called a fragment hit and Hj(x,y) denotes the grey level value of pixel (x,y) 
in this hit. In each one of these hits we apply an edge detector. Some edges, the 
class- specific edges, will be consistently present among hits, while other edges 
are arbitrary and change from one hit to the other. We learn the fragment’s 
consistent edges by averaging the edges detected in these hits. Pixels residing on 
consistent edges will get a high average value, whereas pixels residing on noise 
or background edges will get a lower average, defined by: 

k 

D { x ^y) = \Yl ed 9e(Hj (x, y)) (3) 

Where edge (- ) is the output of an edge detector acting on a given image. By the 
end of this process D{pc,y) is used to define the consistent edges of the fragment 
(see also Fig. 4). 

We differentiate between three types of edges seen in this collection of hits. 
The first, defined here as the border edge , is an edge that separates the figure 
part of the fragment from its background part. This is the edge we are looking 
for. The second, defined here as an interior edge , is an edge within the figure part 
of the object. For instance, a human eye fragment may contain interior edges 
at the pupil or eyebrow boundaries. The last type, noise edge , is arbitrary and 
can appear anywhere in the fragment hit. It usually results from background 
texture or from artifacts coming from the edge detector. The first two types of 
edges are the consistent edges and in the next section we show how to use them 
to segment the fragment. 
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Fig. 4. Learning consistent edges. Fragment (top left) and the consistent boundary lo- 
cated in it (bottom right). To detect the consistent boundary, fragment hits (Hi , . . . Hk) 
are extracted from a large collection of training class images where the fragment is 
detected (Top row shows the hit location in the images, middle row shows the hits 
themselves). An edge detector is used to detect the edge map of these hits (bottom 
row). The average of these edge maps gives the consistent edge (bottom right ). 



3.3 Determining the Figure-Ground Segmentation 

In this section we combine the information supplied by the consistent edges 
computed in the last step with the degree of cover indicating the likelihood of 
fragment regions to be labeled as figure. The goal is to divide each fragment F, 
into a figure part P, and a complementary background part P c in an optimal 
manner. The boundary between P and P c will be denoted by dP. As mentioned, 
the set of consistent edges includes both the figure-ground boundary in the 
fragment (if such exists), as well as consistent internal boundaries within the 
object. Therefore, all the consistent edges should be either contained in the figure 
regions, or should lie along the boundary dP separating P from the background 
part P c . A good segmentation will therefore maximize the following functional: 



P = arg max £ D(x,y) + X Y D ( X ,V) I ( 4 ) 

\(x,y)eP(f) ( x,y)edP(r ) / 

The first term in this functional is maximized when the fragment’s figure part 
contains as many as possible of the consistent edges. The second term is maxi- 
mized when the boundary dP separating figure from ground in the fragment is 
supported by consistent edges. The parameter A (A = 10 in our implementation) 
controls the relative weights of the two terms. 

Solving this problem is straightforward. As noted in (2), there are n possible 
values for r , and each defines a possible segmentation of the fragment into a figure 
part P(r) and background P c (r). It is therefore necessary to check which of the n 
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(a) (b) (c) (d) 

( 1 ) 



(a) (b) (c) (d) 

( 2 ) 



Fig. 5. Stages of fragment figure-ground segmentation (two examples). Given a frag- 
ment (a), we divide it into regions likely to have same figure-ground label. We then use 
the degree of cover to rank the likelihood of each region to be in the figure part of the 
fragment (b). Next, the fragment hits are used to determine its consistent edges (c). In 
the last stage, the degree of cover and the consistent edges are used to determine the 
figure-ground segmentation of the fragment (d). 



options maximizes (4). This procedure alone produces good segmentation results, 
as discussed in the results section. The overall process is illustrated in Fig. 5. The 
figure depicts the stages of labeling two fragments that are difficult to segment. 
Note that by using the degree of cover and border consistency criteria it becomes 
possible to solve problems that are difficult to address using bottom-up criteria 
alone. Some parts of the contours (Fig. 5(1)) separating the figure from the 
background are missing in the fragment but are reconstructed by the consistent 
edges. Similarly, using the border consistency and degree of cover criteria, it is 
possible to group together dissimilar regions (eg. the black and white regions of 
the horse head in Fig. 5(2)) 



4 Image Segmentation by Covering Fragments 



Once the figure- ground labels of the fragments are assigned, we can use them 
to segment new class images in the following manner. The detected fragments 
in a given image serve to classify covered pixels as belonging to either figure or 
background. Each detected fragment applies its figure-ground label to “vote” 
for the classification of all the pixels it covers. For each pixel we count the 
number of votes classifying it as figure versus the number of votes classifying it 
as background. In our implementation, the vote of each fragment had a weight 
w(i). This value was set to the class-specificity of the fragment; namely the ratio 
between its detection rate and false alarms rate. The classification decision for 
the pixel was based on the voting result: 




+1 if J2i w ( i ) L i( x ^y) > ° 

-! if Y^i w (i) L i(x,y) < 0 



( 5 ) 



Where ^2 i w(i)Li(x,y) is the total votes received by pixel (x,y), and 
Li(x,y) = +1 when the figure-ground label of detected fragment Fi votes for 
pixel (x,y) to be figure, Li(x,y) = —1 when it votes for the pixel to be back- 
ground. S(x,y) denotes the figure-ground segmentation of the image: figure pix- 
els are characterized by S(x,y) = +1 and background pixels by S(x,y) = — 1. 
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The segmentation obtained in this manner can be improved using an additional 
stage, which removes fragments that are inconsistent with the overall cover using 
the following procedure. We check the consistency between the figure-ground la- 
bel of each fragment Li and the classification of the corresponding pixels it covers, 
given by S(x,y), using normalized linear correlation. Fragments with low cor- 
relation (we used 0.65 as threshold) are regarded as inconsistent and removed 
from the cover. In the new cover, the figure-ground labels of covering fragments 
will consistently classify overlapping regions. The voting procedure (5) is applied 
again, this time only with the consistent fragments, to determine the final figure- 
ground segmentation of the image. The construction of a consistent cover can 
thus be summarized in two stages. In the first stage, all detected fragments are 
used to vote for the figure or ground labeling of pixels they cover. In the second 
stage, inconsistent fragments that “vote” against the majority are removed from 
the cover and the final segmentation of the image is determined. 

5 Improving the Figure-Ground Labeling of Fragments 

The figure-ground labeling of individual fragments as described in Sect. 3 can be 
iteratively refined using the consistency of labeling between fragments. Once the 
labeled fragments produce consistent covers that segment complete objects in 
the training images, a region’s degree of cover can be estimated more accurately. 
This is done using the average number of times its pixels cover figure parts in 
the segmented training images, rather than the average number of times its pix- 
els overlap with other detected fragments. The refined degree of cover is then 
used to update the fragment’s figure-ground labeling as described in Sect. 3.3, 
which is then used again to segment complete objects in the training images. 
(As the degree of cover becomes more accurate, we can also use individual pixels 
instead of bottom- up subregions to define the fragment labeling.) This iterative 
refinement improves the consistency between the figure-ground labeling of over- 
lapping fragments since the degree of cover is determined by the segmentation 
of complete objects and the segmentation of complete objects is determined by 
the majority labeling of overlapping fragments. This iterative process was found 
to improve and converge to a stable state (within 3 iterations), since majority 
of fragment regions are already labeled correctly by the first stage (see results). 

6 Results 

We tested the algorithm using three types of object classes: horse heads, hu- 
man faces and cars. The images were highly variable and difficult to segment, 
as indicated by the bottom- up segmentation results (see below). For the class 
of horse heads we ran three independent experiments. In each experiment, we 
constructed a fragment set as described in Sect. 2. The fragments were extracted 
from 15 images chosen randomly from a training set of 139 class images (size 
32x36). The selected fragments all contained both figure and background pixels. 
The selection process may also produce fragments that are entirely interior to 
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the object, in which case the degree of cover will be high for all the figure regions. 
We tried two different sizes for the fragment set: in one, we used 100 fragments, 
which on average gave a cover area that is 7.2 times larger than the average area 
of an object; in the second, we used the 40 most informative fragments within 
each larger set of 100 fragments. These smaller sets gave a cover area that was 
3.4 times the average area of an object. We initialized the figure-ground labels 
of the fragments using the method described in Sect. 3. We used the fragments 
to segment all these 139 images, as described in Sect. 4, and then used these 
segmentations to refine the figure- ground labels of the fragments, as described 
in Sect. 5. We repeated this refinement procedure until convergence, namely, 
when the updating of figure-ground labels stabilized. This was obtained rapidly, 
after only three iterations. 

The fragments selected in these experiments all contained both figure and 
background pixels. The selection process may also produce fragments that are 
entirely interior to the object, in which case the degree of cover will be high for 
all the figure regions. 

To evaluate the automatic figure-ground labeling in these experiments, we 
manually segmented 100 horse head images out of the 139 images, and used 
them as a labeling benchmark. The benchmark was used to evaluate the quality 
of the fragments’ labeling as well as the relative contribution of the different 
stages in the learning process. We performed two types of tests: in the first 
(labeling consistency), we compared the automatic labeling with manual figure- 
ground labeling of individual fragments. For this comparison we evaluated the 
fraction of fragments’ pixels labeled consistently by the learning process and by 
the manual labeling (derived from the manual benchmark). 

In the second type of test (segmentation consistency) , we compared the seg- 
mentation of complete objects as derived by the automatically labeled fragments; 
the manually labeled fragments; and a bottom-up segmentation. For this com- 
parison we used the fraction of covered pixels whose labeling matched that given 
by the benchmark. In the case of bottom- up segmentation, segments were labeled 
such that their consistency with the benchmark is maximal . The output of the 
segmentation (given by using [9]) was chosen so that each image was segmented 
into a maximum of 4 regions. The average benchmark consistency rate was 92% 
for the case of automatically labeled fragments, 92.5% for the case of manually 
labeled fragments and 70% for the labeled bottom-up segments. More detailed 
results from these experiments are summarized in Table 1. The results indicate 
that the scheme is reliable and does not depend on the initial choice of frag- 
ments set. We also found that the smaller fragment sets (40th most informative 
within each bigger set) give somewhat better results. This indicates that the 
segmentation is improved by using the most informative fragments. The auto- 
matic labeling of the fragments is highly consistent with manual labeling, and its 
use gives segmentation results with the same level of accuracy as these obtained 
using fragments that are labeled manually. The results are significantly better 
than bottom- up segmentation algorithms. 

Another type of experiment was aimed at verifying that the approach is 
general and that the same algorithm applies well to different classes. This was 
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Table 1 . Results. This table summarizes the results of the first two type of tests we 
performed (labeling and segmentation consistency). 



Large set Small set 



Ex.l 


Ex. 2 


Ex.3 


Ex.l 


Ex. 2 


Ex. 3 


Labeling consistency 
auto. vs. benchmark 


Initial Labeling 
(sect. 3) 


83% 


88% 


80% 


86% 


90% 


93% 


Final Labeling 
(Sect. 5) 


88% 


91% 


89% 


93% 


97% 


95% 


Segmentation 

consistency 


fragments labeled 
automatically 


90% 


90% 


92% 


90% 


90% 


90% 


fragments labeled 
manually 


92% 


91% 


94% 


91% 


95% 


92% 


Bottom-up 

Segmentation 


70% 



demonstrated using two additional classes: human faces and side view images of 
cars. For these classes we did not evaluate the results using a manual benchmark, 
but as can be seen in Fig. 6, our learning algorithm gives a similar level of 
segmentation accuracy as obtained with manually labeled fragments. Examples 
of the final segmentation results on the three classes are shown in Fig. 6. It is 
interesting to note that shadows, which appeared in almost all the training class 
images, were learned by the system as car parts. 

The results demonstrate the relative merits of top-down and bottom-up seg- 
mentation. Using the top-down process, the objects are detected correctly as 
complete entities in all images, despite the high variability of the objects shape 
and cluttered background. Boundaries are sometimes slightly distorted and small 
features such as the ears may be missed. This is expected from pure top-down 
segmentation, especially when fragments are extracted from as few as 15 train- 
ing images. In contrast, bottom-up processes can detect region boundaries with 
higher accuracy compared with top-down processes, but face difficulty in group- 
ing together the relevant regions and identifying figure-ground boundaries - such 
as the boundaries of horse-heads, cars and human faces in our experiments. 

7 Discussion and Conclusions 

Our work demonstrates that it is possible to learn automatically how to segment 
class- specific objects, giving good results for both the figure-ground labeling 
of the image fragments themselves as well as the segmentation of novel class 
images. The approach can be successfully applied to a variety of classes. In 
contrast to previous class- and object-based approaches, our approach avoids the 
need for manual segmentation as well as minimizing the need for other forms 
of manual intervention. The initial input to the system is a training set of class 
and non-class images. These are raw unsegmented images, each having only one 
additional bit of information which indicates the image as class or non-class. The 
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Fig. 6. Results. Rows 1-2, 4-5, 7-8 show figure-ground segmentation results, denoted by 
the red contour. The results in rows 1,4,7 are obtained using the automatic figure- 
ground labeling of the present method. The results in rows 2,5,8 are obtained using a 
manual figure- ground labeling of the fragments. Rows 3,6,9 demonstrate the difficulties 
faced in segmenting these images into their figure and background elements using a 
bottom-up approach [9]: segments are represented by different colors. 
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system uses this input to construct automatically an internal representation for 
the class that consists of image fragments representing shape primitives of the 
class. Each fragment is automatically segmented by our algorithm into figure and 
background parts. This representation can then be effectively used to segment 
novel class images. 

The automatic labeling process relies on two main criteria: the degree of cover 
of fragment regions and the consistent edges within the fragments. Both rely on 
the high variability of background region compared with the consistency of the 
figure regions. We also evaluated another natural alternative criterion based on 
a direct measure of variability: the variability of a regions’ properties (such as 
its grey level values) along the fragment’s hit samples. Experimental evaluation 
showed that the degree of cover and border consistency were more reliable cri- 
teria for defining region variability - the main reason being that in some of the 
fragment hits, the figure part was also highly variable. This occurred in partic- 
ular when the figure part was highly textured. In such cases, fragments were 
detected primarily based on the contour separating the figure from background 
region, and the figure region was about as variable as the background region. It 
therefore proved advantageous to use the consistency of the separating boundary 
rather than that of the figure part. 

Another useful aspect is the use of inter- fragment consistency for iterative 
refinement: the figure-ground segmentation of individual fragments is used to 
segment images, and the complete resulting segmentation is in turn used to 
improve the segmentation of the individual fragments. 

The figure-ground learning scheme combined bottom-up and top-down pro- 
cesses. The bottom- up process was used to detect homogenous fragment regions, 
likely to share the same figure-ground label. The top-down process was used to 
define the fragments and to determine for each fragment its degree of cover and 
consistent edges likely to separate its figure part from its background part. This 
combination of bottom-up and top-down processes could be further extended. In 
particular, in the present scheme, segmentation of the training images is based 
on the cover produced by the fragments. Incorporating similar bottom- up crite- 
ria at this stage as well could improve object segmentation in the training images 
and consequently improve the figure-ground labeling of fragments. As illustrated 
in Fig. 6, the top down process effectively identifies the figure region, and the 
bottom- up process can be used to obtain more accurate object boundaries. 
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Abstract. We investigate the automated reconstruction of piecewise 
smooth 3D curves, using subdivision curves as a simple but flexible curve 
representation. This representation allows tagging corners to model non- 
smooth features along otherwise smooth curves. We present a reversible 
jump Markov chain Monte Carlo approach which obtains an approxi- 
mate posterior distribution over the number of control points and tags. 
In a Rao-Blackwellization scheme, we integrate out the control point lo- 
cations, reducing the variance of the resulting sampler. We apply this 
general methodology to the reconstruction of piecewise smooth curves 
from multiple calibrated views, in which the object is segmented from the 
background using a Markov random field approach. Results are shown 
for multiple images of two pot shards as would be encountered in archae- 
ological applications. 



1 Introduction 

In this paper we investigate the reconstruction of piecewise smooth 3D curves 
from multiple calibrated views. Among other applications, this is useful for the 
reconstruction of shards and other artifacts that are known to have “jagged 
edges”. A motivating example of a broken pot-shard is shown in Figure 1. Such 
objects frequently show up in large museum collections and archaeological digs, 
and hence solving this problem would have important implications in preserving 
our cultural heritage. One possible application of our work is the automatic 
reconstruction of archaeological artifacts [1,2]. Apart from these applied uses, 
the problem of representing and performing inference in the space of piecewise 
smooth curves is of interest in its own right, and the methods developed here 
have potential application for other types of objects that have both continuous 
and discrete parameters that need to be optimized over. 

We focus on the reconstruction of curves rather than surfaces. Existing 3D 
surface reconstruction methods rely on automatically extracted points and/or 
lines [3]. In the case of textured objects or when using structured light, these 
methods can be used successfully to densely sample from the 3D surface of an 
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Fig. 1 . Two automatically segmented pot-shard images. The boundaries derived from 
an MRF-based segmentation algorithm are shown in white. 



object. However, these methods fail to use or capture the fact that an object like 
a shard is delineated by a closed boundary curve. In this paper we compliment 
traditional 3D reconstruction methods by explicitly recovering these curves. 

To model piecewise smooth curves we use tagged subdivision curves as the 
representation. This is inspired by the work of Hoppe [4] , who successfully used 
tagged subdivision surfaces for fitting piecewise smooth surfaces to 3D point 
clouds. The curve fitting literature includes use of algebraic curves [5,6], piecewise 
polynomials [7,8,9], point curves [10], and B-splines [11,12,13]. 

To the best of our knowledge, no prior work on fitting subdivision curves 
exists. Subdivision curves are simple to implement and provide a flexible way of 
representing curves of any type, including all kinds of B-splines and extending to 
functions without analytic representation [14]. In [4], Hoppe introduces piecewise 
smooth subdivision surfaces , allowing to model sharp features such as creases 
and corners by tagging the corresponding control points. We apply the tagging 
concept to subdivision curves to represent piecewise smooth curves. Earlier, we 
have presented this idea with some preliminary results in a workshop paper [15]. 
In this paper we significantly extend our approach to automatically determine 
the number of control points needed for a good fit. Furthermore, we replaced 
the manual segmentation process with an automatic MRF-based segmentation 
preprocessing step, obtaining a completely automated system. 

To infer the parameters of these curves from the data, we propose Rao- 
Blackwellized sampling. In our approach, Markov chain Monte Carlo (MCMC) 
sampling [16] is used to obtain a posterior distribution over the discrete variables, 
while the continuous control point locations are integrated out after a non-linear 
optimization step. We also sample over the number of control points, using the 
framework of reversible jump (RJ) MCMC that was introduced by Green [17] 
and later described in a more easily accessible way as trans-dimensional MCMC 
[18]. In related work, Denison and Mallick [7,8] propose fitting piecewise poly- 
nomials with an unknown number of knots using RJMCMC sampling. Punskaya 
[9] extends this work to unknown models within each segment with applications 
in signal segmentation. DiMatteo [13] extends Denison’s work for the special 
case of natural cubic B-splines, handling non-smooth curves by representing a 
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corner with multiple knots. However, multiple knots cannot be at the same loca- 
tion, and therefore only approximate the corner. With our method, corners can 
be represented exactly with a single control point. In addition, we are working 
with a much reduced sample space, as we directly solve for optimal control point 
locations and hence only sample over the boolean product space of corner tags. 

We apply this general methodology to 3D reconstruction of piecewise smooth 
curves from multiple calibrated images. While much of the curve fitting literature 
is concerned with ID curves for signal processing [8,7,13,9], in computer vision 
it is more common to fit curves in 2D or 3D. For example, 2D curves are often fit 
to scattered point data [6] and image contours [11]. For the special case of stereo 
cameras, [12] describes reconstruction and tracking of 3D curves represented by 
2D B-spline curves that are either coupled through epipolar geometry constraints 
or are coupled to a canonical frame model through affine transformations. More 
general multiple view curve reconstruction methods are described in [5] using 
algebraic curves and in [10] for point curves with uncalibrated cameras. 







Fig. 2. The subdivision process: (a) initial control points, linearly interpolated; (b) the 
original mesh, subdivided by introducing average points; (c) the result after application 
of the averaging mask to the subdivided control points; (d) converged curve. 



2 Subdivision Curves 



Here we briefly review subdivision curves. See [14] for more details. A sub- 
division curve is defined by repeatedly refining a vector of control points 
&t = (#o, x i, •••, x n 2 t -i)^ wh ere n is the initial number of control points and t 
the number of subdivisions performed. This refinement process can be separated 
into two steps as shown in Figure 2: a splitting step that introduces midpoints: 



x 



t+i 
2 i 



= Xa 



r ^+l — 

x 2i+l — 



1 

2 



(xj + xj +1 ) 



and an averaging step that computes weighted averages: 



i ! +1 = 



k 



The type of the resulting curve depends on the averaging mask r. Subdivision 
can be used to create a wide range of functions [14], including uniform and 
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non-uniform B-splines, and functions that have no analytic representation like 
Daubechies wavelets. For example, the mask for a cubic B-spline is | (1, 2, 1). 

As explained in [14], the splitting and averaging steps can be combined into 
multiplication of the local control points with a local subdivision matrix L, e.g. 

i Z 440 

L=- 161 

8 \044 

for cubic B-splines. Repeated application of this matrix to a control point and 
its immediate neighbors results in a sequence of increasingly refined points that 
converges to the limit value of the center point. Eigenvector analysis on the 
matrix L leads to an evaluation mask u that can be applied to a control point 
and its neighbors, resulting in the limit position for that control point. For 
example, the evaluation mask for cubic B-splines is 

U = \ (1,4,1) 

0 

The curve can be refined to the desired resolution before this mask is applied. 

It is convenient for the exposition below to view the entire subdivision process 
as a large matrix multiplication 



C = SO 



(i) 



where C is the final curve, the n x 1 vector O represents the control 
points/polygon, and the subdivision matrix S combines all m subdivision steps 
and the application of the evaluation mask into one n2 m x n matrix. This can be 
done as both subdivision and evaluation steps are linear operations on the con- 
trol points. The final curve C is a n2 m x 1 vector that is obtained by multiplying 
S with the control point vector O as in (1). 

The derivative of the final curve C with respect to a change in the control 
points (9, needed below to optimize over them, is simply a constant matrix: 



dC _ d(SO) 

dO ~ 30 



( 2 ) 



While the above holds for ID functions only, an n-dimensional subdivision curve 
can easily be defined by using n-dimensional control points, effectively repre- 
senting each coordinate by a ID subdivision curve. Our implementation is done 
in the functional language ML and uses functors to remain independent of the 
dimensionality of the underlying space. Note that in this case, the derivative 
equation (2) holds for each dimension separately. 



3 Piecewise Smooth Subdivision Curves 

There are a number of ways to represent sharp corners in otherwise smooth 
curves. One solution is to place multiple control points at the location of a 
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Fig. 3. A tagged 2D subdivision curve. Tagged points are drawn as circles, (a) the 
original control points, (b) the converged curve with non-smoothly interpolated points. 

corner, but consecutive subdivision creates superfluous points at this location. 
Furthermore, it places unwanted constraints on the adjacent curve segments. A 
commonly used method in connection with B-splines is extra knot insertion. 

We employ a more general method here based on work of Hoppe [4] for subdi- 
vision surfaces. It allows us to “tag” control points, allowing different averaging 
masks to be used at these points. E.g., using the mask (0, 1,0) forces the inter- 
polation of the control point and at this point introduces a discontinuity in the 
derivative, while retaining the smooth curve properties for all other points of the 
curve (Figure 3). The number of tagged control points does not increase during 
the subdivision process, and so the non-smoothness of the curve is restricted to 
the tagged control points, which will always be interpolated. 

Below we will use the following notation to describe tagged 3D subdivision 
curves. The locations of the 3D control points are given by 0 = {xq, 
where n is the number of control points. For each original control point x®, a 
boolean tag bi indicates whether it is non-smoothly interpolated, i.e. whether 
there is a “corner” at control point i. The collection of tags bi is written as 

T= {bo, 

4 Rao-Blackwellized Curve Fitting 

In this section we describe how the parameters of a tagged subdivision curve 
can be estimated from noisy measurements Z, irrespective of how those mea- 
surements were obtained. In Section 5 this will be specialized to the problem of 
fitting from multiple, calibrated segmentations of an object. 

Because the measurements are noisy, we take a probabilistic approach. Of 
interest is the posterior distribution 

F(n, <9 n , T n \Z) oc P(Z|n, (9 n , T n )P(O n \n, T n )P(T n |n)P(n) (3) 

over the possible number of control points n, the control point values O n and the 
tag variables T n . Here the control points 0 n E M 3n are continuous and the tags 
T n E {0, l} n are discrete. The likelihood P(Z|n, (9 n , T n ) and control polygon prior 
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P((9 n |n,T n ) are application specific and will be specified in Section 5. Choosing 
the complexity prior P(n ) to be uniform over a range of valid values allows us 
to find an unbiased distribution over the number of control points. Similarly, 
we use an uninformative tagging prior P(T n |n), but a binomial distribution 
P(T n |n) oc p c ( 1 — p) n ~ c over the number of active tags c could also be used. 



4.1 Trans-dimensional MCMC 



Since the number of possible tag configurations is 2 n and hence exponential 
in n, we propose to use reversible jump Markov chain Monte Carlo (MCMC) 
sampling [18,17] to perform approximate inference. MCMC methods produce an 
approximate sample from a target distribution i r(X), by simulating a Markov 
chain whose equilibrium distribution is 7r(X). The algorithm starts from a ran- 
dom initial state and proposes probabilistically generated moves in the 

state space, which is equivalent to running a Markov chain. The specific MCMC 
algorithm we use is the trans-dimensional MCMC algorithm from [18]: 

1. Start with a random initial state X^ 0 ). 

2. Propose a move type m £ M with probability j(m). 

3. Generate a random sample u from the move-specific proposal density g m . 
The move type m and random sample u determine how to move from the 
current state X ^ to the proposed state X ' . 

4. Calculate the corresponding reverse move 

5. Compute the acceptance ratio 

= TI-(V) j( TO ') flm'K) 

7r(V r )) j{m) g m {u) 

where the Jacobian factor corrects for the change in variables (see below). 

6. Accept X( r+1 ) <— X' with probability min(a, 1), otherwise X( r+1 ) <— X^ r \ 



d(x ‘ , u) 



d(x , u) 



(4) 



The generated sequence of states {X^} will be a sample from i r(X) if the 
sampler is run sufficiently long, and one discards the samples in the initial “burn- 
in” period of the sampler to avoid dependence on the chosen start state. 

For fitting tagged subdivision curves, one possible set of move types consists 
of “Up” and “Down” for inserting and deleting control points and “Modify” 
for flipping one or more of the tags. Care has to be taken to ensure that the 
move from (x,u) to (x',u f ) is reversible and therefore a diffeomorphism. One 
requirement is that the dimensions on both sides have to match. Note that in 



our case the Jacobian of the diffeomorphism 



d(x' ,u') 
d(x,u ) 



is always 1 because we 



integrate out the continuous part of the space (see below). 



4.2 Rao-Blackwellization 

Sampling over the joint discrete-continuous space is expensive. A crucial element 
of our approach is to not sample from the joint posterior (3) but rather from the 
marginal distribution P{n,T n \Z) over the number of points n and the tags T: 
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tt(X) i P(n,T n \Z) = J P(n, 0 n , T n \Z)dO n (5) 

while performing the integration (5) above analytically. 

From a sample over n and T of size N we can obtain a high quality approx- 
imation to the joint posterior as follows 

N 

P(n,0 n ,T n \Z) « ^P(0„|Z, ( n,T n ) ir) )6((n,T n ) , (n,T n ) (r) ) (6) 

r=t 



with S (., .) being the Kronecker delta. Thus, (6) approximates the joint posterior 
P(n, @ n ,T n \Z ) as a combination of discrete samples and continuous conditional 
densities P(O n \Z , (n,T n )^). 

Integrating out the continuous part of the state space reduces the number 
of samples needed. The superiority of (6) over a density estimate based on joint 
samples is rooted in an application of the Rao-Blackwell theorem, which is why 
this technique is often referred to as Rao-Blackwellization [19,20,21]. Intuitively, 
the variance of (6) is lower because it uses exact conditional densities to approx- 
imate the continuous part of the state. As such, far fewer samples are needed to 
obtain a density estimate of similar quality. 

Substituting the factorization of the posterior (3) in (5) we obtain 



P(n,T n \Z) = kP(n)P(T n \n) J P(Z\n, O n , T n )P(O n \n, T n )dO n 



Assuming the conditional posterior P(Z\n, (9 n , T n )P(0 n |n, T n ) is approximately 
normally distributed around the MAP estimate of the control points (9* 



P(@ n \Z, n, T n ) 



1 _ i 

ZWa 2 



\\e n -e* 



2 

U 



the integral can be approximated via Laplace’s method, and we obtain the fol- 
lowing target distribution over the number of control points n and the tags T n : 

P(n,T n \Z) = kP(n)P(T n \n)^/\2irE\P(Z\n, <9*, T n )P(0^\n,T n ) 



The MAP estimate 0* can be found by non-linear optimization (see below). 



5 Multiple View Fitting 

In this section we specialize the general methodology of Section 4 to reconstruct- 
ing tagged 3D subdivision curves from multiple 2D views of a “jagged” object. 
We assume here that (a) the images are calibrated, and (b) the measurements Z 
are the object boundaries in the 2D images, i.e. the object has been segmented 
out from the background. For the results presented below, the object boundaries 
are segmented automatically using a Markov random field approach. 
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To calculate an objective function, the 3D curve given by O n and T n is 
subdivided m times to the desired resolution, evaluated, and the resulting points 
{pi}i are projected into each view. We then use the following form for the 
likelihood: 

P(Z\n,0 n ,T n ) OC e ~^E(Z,n,0 n ,T n ) 

where a 2 is a variance of the noise in the measurements. The error function E 
above is obtained as a sum of squared errors 

C n2 m 

E(Z, n, (9 n , T n ) = EE A(n c ( Pi ),Z c ) 2 (7) 

c=l i= 1 

with one term for each of the n2 m final subdivision curve points in each of 
the C images, explained in more detail below. The prior P(n) on the number 
of control points, the prior P(T) on the tag configuration and the conditional 
control polygon prior are taken to be uniform in all the results reported below. 

Each error term A(II c (pi), Z c ) in (7) determines the distance from the pro- 
jection n c (pi) of a point pi into view c, to the nearest point on the object 
boundary Z c . In order to speed up this common calculation, we pre-calculate a 
lookup table for A by means of the well known distance transform to obtain a 
Chamfer image [22]. Each pixel in a Chamfer image contains the distance from 
this pixel to the nearest point on the segmented curve Z c in view c. Calculating 
the Chamfer images has to be done only once and runs in linear time. In this 
way, we trade memory usage for computational speed. 

The outlines of the shards were automatically segmented as foreground and 
background classes using a Markov random field (MRF) approach. To offset the 
analytical intractability associated with MRFs, we employed Gibbs sampling to 
approximate the posterior probability P(Z\I) of the outlines Z given the input 
images I. The sampling is initialized by selecting the class with highest likelihood 
for each pixel. Consequently, Gibbs sampling requires only a few samples to 
achieve accurate segmentations. 

The reprojection of the 3D curve in the images is done in the standard way [3] . 
We assume the cameras are described using a single radial distortion parameter 
ft, focal lengths f x and f y , principal point ( p x ,p y ), and skew s. The pose of a 
camera is given by a rotation R and translation t. The projection of a 3D point 
X into an image is then given by 

n(X) = V(K[R\t],K,X) 

where K is the 3x3 calibration matrix 



f fx S p x \ 
K= I fyPy 1 



and D is a function that models the radial distortion. 
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To minimize (7) given a specific tag configuration T, we use Levenberg- 
Marquardt non-linear optimization. To obtain the derivative of the error 
function E with respect to the original control points (9o, we apply the chain 
rule to combine the derivatives of the camera projections and the derivative S 
from equation 2 on page 332 of the subdivision curve with respect to the original 
control points. Implementing the chain rule involves a pointwise multiplication 
of the projected curve derivatives with the Chamfer image gradients, which are 
estimated by convolution with a Gaussian derivative mask. 

6 Results 

We illustrate our approach on two sets of pot-shard images, shown in Figure 
4. The shards were placed on a calibration pattern, which allowed us to easily 
optimize for the camera calibration parameters. Six images for the first pot shard 
and four images for the second were used for curve fitting, all of which are taken 
from about the same angle from different sides of the shard. Two of those views 
are shown in columns (a) and (b) of Figure 4. The images in column (c) were 
not used for fitting and serve as verification views. They are informative, since 
they are taken at a much lower angle than all other images. The shard images 
were automatically segmented by an MRF approach, as described in Section 5. 
Figure 1 shows two such segmentations overlayed on the corresponding shard 
images. 

The fitting process starts with a circular configuration of a small number of 
control points in a plane parallel to the ground plane, as shown in the first row 
of Figure 4. Each proposal for the Markov Chain consists of a random move as 
described in Section 4.1, followed by non-linear optimization of the control point 
locations. The move can either introduce a new control point, delete an existing 
one, or change the tag configuration by inverting each tag with probability K 
For evaluation of the error function, the curve is subdivided four times before 
the limit positions of the points are calculated using the evaluation mask. 

After only five iterations, as shown in the second row of Figure 4, the subdivi- 
sion curve adapted pretty well to the boundary, but the number of control points 
is still too small and the tagging is wrong. After 250 iterations the number of con- 
trol points and the tagging configuration both adapted to fit the shard boundary 
well. Note that we sample from the posterior distribution, and the number of 
control points and the tagging configuration never converge to a single value. 

The algorithm finds a suitable number of control points independent of the 
initial number, as can be seen in Figure 5(a), where this number is plotted 
over time. Independent of the initial number of control points, the algorithm 
converges quickly to a “optimal” number that is influenced by the curve itself 
and the variance of the measurements. 

The diagram in Figure 5(a) shows nicely the burn-in phase of the Markov 
chain. In what follows, the posterior distribution is determined by throwing 
away the first 500 of the 1500 samples to make the result independent of the 
initialization. 1500 samples seem sufficient, an evaluation of 9000 samples did 
not show a significant change in results. 
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(a) (b) (c) 

Fig. 4. Results are shown for two pot shards. Projections of the control points are 
drawn as yellow ’+’ for untagged and yellow ’ x’ for tagged ones, and the corresponding 
subdivision curve is drawn in white. Six views are used for the fitting process of the 
first shard, while only four are used for the second shard. In both cases, two of those 
are shown in columns (a) and (b). The third column (c) shows a view that is not used 
for fitting and is taken from a lower angle than all other images. 

The first row shows the projections of the initial control points and the corresponding 
3D subdivision curve on images of the first shard. The second row shows results after 
five iterations (error: 9.8 • 10 2 ). The third row is sample number 250 (error: 4.3 • 10 2 ). 
The last row is the result for the second shard after 250 samples. 
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Number of 
Ctrl. pts. 



10 12 



14 



(a) (b) 

Fig. 5. (a) Number of control points during the burn-in phase. The lower curve starts 
with six control points, the upper one with 14. (b) Probability over different numbers 
of control points after burn-in phase. 



One example of a marginal distribution that can be approximated from these 
samples is the number of control points that are needed to fit the curve well, as 
shown in Figure 5(b). 

7 Conclusion 

We investigated modeling piecewise smooth curves with tagged subdivision 
curves that provide a simple and flexible representation. The parameters of these 
curves were determined by a Rao-Blackwellized sampler, in which the optimal 
locations of the control points were integrated out and determined by non-linear 
optimization, and only the distribution over the number of control points and 
the tag configurations was sampled over. 

This method was successfully applied to 3D curve reconstruction from mul- 
tiple images, as illustrated in the results section. These results were obtained in 
automated fashion, using an MRF-based segmentation approach to automati- 
cally segment the images with a known calibration background. Then, starting 
from an initial circular distribution of the control points, the algorithm approx- 
imated the object boundary well within a small number of sampling steps. 

It would be of interest to more closely examine the quality of the Gaussian 
assumption made in the Rao-Blackwellization step. One way to validate this 
assumption is by MCMC sampling over the control point locations for a given 
tag configuration. Also, on the image processing side, there are some problems 
with double shard boundaries and occluding contours that need to be resolved 
in order to create a robust, automated system. 

We would like to connect our 3D curve reconstruction methods with a com- 
plete “broken-pot” reconstruction such as described in [2] . Comparing features of 
boundaries from different shards could be used for reconstruction of archaeolog- 
ical artifacts, possibly in connection with other features, like texture and surface 
curvature, as suggested in [1]. Finally, it is our hope that the general method- 
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ology described here can be successfully applied in other discrete-continuous 

reconstruction settings. 

Acknowledgments. We would like to thank Peter Presti from IMTC for pro- 
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Abstract. Under dimly lit condition, it is difficult to take a satisfac- 
tory image in long exposure time with a hand-held camera. Despite the 
use of a tripod, moving objects in the scene still generate ghosting and 
blurring effect. In this paper, we propose a novel approach to recover 
a high-quality image by exploiting the tradeoff between exposure time 
and motion blur, which considers color statistics and spatial constraints 
simultaneously, by using only two defective input images. A Bayesian 
framework is adopted to incorporate the factors to generate an optimal 
color mapping function. No estimation of PSF is performed. Our new 
approach can be readily extended to handle high contrast scenes to re- 
veal fine details in saturated or highlight regions. An image acquisition 
system deploying off-the-shelf digital cameras and camera control soft- 
wares was built. We present our results on a variety of defective images: 
global and local motion blur due to camera shake or object movement, 
and saturation due to high contrast scenes. 



1 Introduction 

Taking satisfactory photos under weak lighting conditions using a hand-held 
camera is very difficult. In this paper, we propose a two-image approach to 
address the image recovery problem by performing intensity correction. In order 
to exploit the tradeoff between the exposure time and the blurring degree of the 
captured images, we take the two input images using the same camera with the 
following exposure settings: 

— One image II is taken with exposure time around the safe shutter speed 1 , 
producing an under-exposed image where motion blur is largely reduced. 
Since it is too dark, the colors in the image are not acceptable (Fig. 1(a)). 

* This work is supported by the Research Grant Council of Hong Kong Special Ad- 
ministration Region, China: HKUST6193/02E. 

1 In photography, the safe shutter speed is assumed to be not slower than the reciprocal 
of the focal length of the lens, in the unit of seconds [1]. The longer the exposure 
time, the blurrier the image becomes. 

T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3023, pp. 342-354, 2004. 
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(a): II (b):l H (C): Ic 



Fig. 1. We take two successive images with different exposure intervals to construct 
the high-quality image. 

— The other image Ih is a normal image acquired under an extended exposure 
time. The color and brightness of this image is acceptable. However, it is 
motion blurred because of camera shaking or moving objects in the scene 
(Fig. 1(b)). 

The images can be taken by a hand-held camera, and possibly in a dimly 
lit condition. Combining these two defective images II and Ih, our method 
automatically generates a clear and crisp image Ic, as shown in Fig. 1(c). 

There are several related techniques to recover images from camera when 
exposure time is above the safe shutter speed. They can be roughly classified 
into in-process and post-process approaches, which eliminate motion blur due to 
long exposure and camera shake. In-process approaches are mainly hardware- 
based techniques, where lens stabilization is achieved by camera shake compensa- 
tion [8,9]. Alternatively, CMOS cameras can perform high-speed frame captures 
within normal exposure time, which allows for multiple image-based motion blur 
restoration [11]. These methods are able to produce clear and crisp images, given 
a reasonable exposure time. However, they require specially designed hardware 
devices. 

On the other hand, post-process methods are mostly motion deblurring tech- 
niques. Among them, blind deconvolution is widely adopted to enhance a single 
blurred image, under different assumptions on the PSF [6,15,10,14]. Alterna- 
tively, several images with different blurring directions [12] or an image se- 
quence [2] is used, in more general situations, to estimate the PSF. In both 
cases, due to the discretization and quantization of images in both spatial and 
temporal coordinates, the PSF can not be reliably estimated, which produced a 
result inferior to the ground truth image if available (which is an image either 
taken with a camera on a tripod, or of a static scene). 

Ezra and Nayar [3] proposed a hybrid imaging system consisting of a primary 
(high spatial resolution) detector and a secondary (high temporal resolution) 
detector. The secondary detector provides more accurate motion information to 
estimate the PSF, thus making deblurring possible even under long exposure. 
However, the method needs additional hardware support, and the deblurred 
image can still be distinguishable from the ground truth image. 

Because of the weakness of the debluring methods, we do not directly perform 
deblurring on Ih • Instead, an image color correction approach is adopted. By 
incorporating the color statistics and the spatial structures of Ih and //,, we 
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propose a Bayesian framework, and maximize the a posterior (MAP) of the 
color mapping function /(•) from II to Ih in the color space so that the under- 
exposed II is enhanced to a normally exposed image Ic- 

Our method can deal with camera shake and object movement at the same 
time, and in an unified framework. Moreover, change of object topology or object 
deformation can also be naturally handled, which is difficult for most deblurring 
methods, since different parts of the object have different PSFs. Besides, by 
slightly modifying one constraint, our method can be extended to deal with high 
contrast scenes, and automatically produce images which capture fine details in 
highlight or saturated area. 

The rest of this paper is organized as follows: we describe our image acquisi- 
tion system in Section 2. Section 3 defines the relationship between II and Ih • In 
Section 4, we state and define our problem, propose our probabilistic model, and 
infer the color mapping function in the Bayesian framework. Section 5 presents 
our results. Finally, we conclude our paper in Section 6. 



2 Image Acquisition 

To correctly relate two images, we require that A be taken almost immediately 
after Ih is taken. This is to minimize the difference between the two images 
and to maximize the regional match of the positions of each pixel if the time 
lapse is kept as short as possible, as illustrated in Fig. 2(a). In other words, 
the under-exposed image II can be regarded as a sensing component in the 
normally exposed image Ih in the temporal coordinates. This requirement makes 
it possible to reasonably model the camera movement during the exposure time, 
and constrain the mapping process. 

Our image acquisition system and its configuration is in Fig. 2(b). The digital 
camera is connected to the computer. The two successive exposures with different 
shutter speeds are controlled by the corresponding camera software. This setup 




Fig. 2. (a) Two successive exposures guarantee that the center of the images do not 
vary by too much, (b) The configuration of our camera system. 
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frees the photographer from manually changing the camera parameters between 
shots. So that s/he can focus on shooting the best pictures. 

A similar functionality, called Exposure Bracketing , has already been built 
in many digital cameras, e.g., Canon G-model and some Nikon Coolpix model 
digital cameras. With one shutter pressing, two or three successive images are 
taken with different shutter speeds under the same configurations. However, 
using the built-in camera functionality has some limitations: it does not operate 
in manual mode, and the difference of shutter speeds is limited. 

In the next section, we analyze the relationship between II and /#, and 
propose the constraints that relate these two images. 

3 Relationship between I L and I H 

II and Ih are two images of the same scene taken successively with different ex- 
posures. Therefore, they are related not only by the color statistics, but also by 
the corresponding spatial coherence. In this section, we describe their relation- 
ship, which are translated into constraints for inferring a color mapping function 
in our Bayesian framework, which will be described in the next section. 

3.1 Color Statistics 

In RGB color space, important color statistics can often be revealed through 
the shape of a color histogram. Thus, the histogram can be used to establish 
explicate connection between Ih and II • Moreover, since high irradiance always 
generates brighter pixels [7], the color statistics in II and Ih can be matched 
in order from lower to higher in pixel intensity values. Accordingly, we want to 
reshape the histogram of //,, say, hi L , such that: 

g{hi L ) = hi H ( 1 ) 

where g(-) is the transformation function performed on each color value in his- 
togram, and hi H is the histogram of /#. A common method to estimate g(-) 
is adaptive histogram equalization , which normally modifies the dynamic range 
and contrasts of a image according to a destination curve. 

Unfortunately, this histogram equalization does not produce satisfactory re- 
sults. The quantized 256 (single byte accuracy) colors in each channel are not 
sufficient to accurately model the variety of histogram shapes. Hence, we adopt 
the following method to optimally estimate the transformation function: 

1. Convert the image from RGB space to a perception-based color space l a (3 [4], 
where the l is the achromatic channel and a and /3 contain the chromaticity 
value. In this way, the image is transformed to a more discrete space with 
known phosphor chromaticity. 

2. Accordingly, we cluster the color distributions in the new color space into 
65536 (double byte precision) bins, and perform histogram equalization. 

3. Finally, we transform the result back to the RGB space. 

By performing this transformed histogram equalization , we relate the two 
images entirely in their color space. 




